1. Introduction
The European Union Emissions Trading System (EU ETS) has served as a cornerstone of Europe’s climate change mitigation strategy within the Fit for 55 package and represents the largest and most established carbon market globally [
1,
2,
3]. The price of carbon allowances (EU allowances, EUA) directly influences operational and strategic decisions across industries, affecting production costs, investment in low-carbon technologies, and overall competitiveness [
4,
5]. Therefore, accurately forecasting the EUA prices is critical for policymakers aiming to design effective climate policies, investors managing risks and returns, and market participants navigating the complexities of compliance and competitiveness in a carbon-constrained economy [
6,
7].
Forecasting carbon prices in the EU ETS, however, poses a unique set of challenges [
8]. Carbon markets are influenced by a complex interplay of economic indicators, policy developments, energy market dynamics, and sentiment-related factors, leading to high volatility and uncertainty in price movements [
9,
10,
11]. Traditional econometric forecasting methods often fall short in capturing these dynamic interactions and the nonstationary nature of the carbon price time series [
12,
13,
14]. Recent approaches have employed deep learning techniques to improve forecasting accuracy [
15,
16,
17]. One prevalent paradigm involves ensemble models combined with optimization algorithms, integrating multiple forecasting models to capture different aspects of the data [
18,
19,
20,
21,
22,
23]. Another is based on the “divide and conquer” principle [
24,
25], utilizing decomposition-ensemble methods where the original time series is decomposed into components such as trend, seasonality, and noise, each modeled separately before integration [
26,
27,
28,
29,
30,
31,
32,
33]. While these methods have considerably advanced point forecasting, the increasingly complex climate and economic policy environment necessitates more comprehensive predictive information to capture the uncertainties of volatile markets like the EU ETS [
34,
35]. Point forecasts provide valuable insights into expected future prices; however, they may not fully encompass the range of possible variations around multi-expected values [
36,
37,
38,
39,
40,
41]. Therefore, additional forecasting approaches such as interval-valued forecasting are needed to provide deeper insights and enhance decision-making in managing exposure to risk [
42].
Interval-valued forecasting offers a more comprehensive approach by predicting a range of potential prices, represented by upper and lower bounds [
37,
43]. The methodology provides valuable measures of market uncertainty and risk, offering deeper insights into the temporal dynamics affecting carbon prices [
44,
45]. For example, accurately assessing the trends of high and low price bounds enables market participants to better estimate market volatility and develop robust hedging strategies [
46]. Policymakers can also benefit from interval forecasts by understanding the potential impacts of policy interventions and regulatory changes on the carbon trading market [
47]. While interval-valued forecasting has gained traction, most existing research has focused on China’s carbon pilot markets, majorly employing interval decomposition-ensemble paradigms [
44,
48,
49,
50,
51,
52]. Research on interval carbon price forecasting for the EU ETS remains scarce. Only a few studies, such as those by Zhu et al. [
40], Tian and Hao [
53], Niu et al. [
54], Wang et al. [
55], Zhao et al. [
56], Yang et al. [
57], and series of research focused on open-high-low-close prices developed by Huang et al. [
58,
59,
60] have explored this area, indicating a large gap in developing effective forecasting frameworks specifically tailored for the EU ETS. Given its unique characteristics and global significance, advancing interval forecasting methodologies for the carbon market has become both necessary and timely. Moreover, in recent years, the integration of big data sources such as online news and search trends has opened new dimensions in predictive analytics [
61,
62,
63,
64]. In the energy and financial markets, incorporating news sentiment and search index into forecasting models has been proved to effectively enhance the predictive performance of deep learning models [
38,
39]. Nevertheless, the utilization of such big data sources in the interval-valued carbon prices has been in its emerging stage [
65,
66,
67].
Recent advancements in machine learning and big data analytics have provided new opportunities to address the interval-valued forecasting challenges [
68]. For instance, interval decomposition-ensemble methods typically employ rolling decomposition techniques to separate the original time series into components such as trend, seasonality, and noise [
36,
44,
55,
57]. The rolling procedure aims to reduce potential data leakage and adapt to structural changes in the series, with each component subsequently modeled by specialized algorithms before recombining their forecasts. By isolating distinct temporal patterns, these methods provide fine-grained insights into the driving factors behind the interval-valued carbon prices fluctuation, which can be useful in contexts where policy adjustments and market conditions evolve over time. In addition, ensemble optimization algorithms integrate multiple predictive modules—each capturing specific aspects of the data—to form a collective forecast [
40,
45,
48,
53,
54]. Through iterative optimization procedures, the ensembles seek to balance and synthesize diverse viewpoints, allowing for robust forecasting under various market scenarios. In contrast to the above methods that partition time series signals or integrate multiple standalone forecasts, the multi-task learning (MTL) perspective interval prediction by simultaneously modeling the upper and lower bounds within one unified learning framework. MTL frameworks and deep learning models like Transformers have demonstrated promise in handling large datasets and complex temporal dependencies common in nonlinear and nonstationary data [
69], and offers the advantage of learning shared representations across related tasks, potentially improving performance compared to learning each task independently [
70]. MTL can further accommodate diverse-source data, offering an alternative route to capturing the uncertainties and complexities of the interval-valued carbon price dynamics. Successful applications of MTL in energy and financial markets have been reported, such as in electricity load forecasting [
71,
72,
73] and stock price prediction [
74,
75], where MTL models outperformed single-task approaches. Despite these advancements, applying MTL to interval-valued carbon price forecasting remains underexplored.
In this study, we propose a novel multi-task learning framework that integrates multi-source heterogeneous data (online news, Google trends, and related market futures prices) using a Transformer-based model (Temporal Fusion Transformer, TFT) [
76] for the interval-valued EUA futures prices forecasting. By incorporating online news sentiment intensity and the Internet search index into the framework, we aim to enhance the predictive capability for the interval-valued carbon prices. The framework’s effectiveness is evaluated through comprehensive experiments, including model comparisons and ablation studies. We also conduct systematic robustness tests under different market uncertainty conditions to assess the model’s stability and reliability. Furthermore, the time-varying interpretability provided by our framework offers practical insights into variable importance and temporal patterns affecting carbon price forecasting.
This paper contributes to the literature and practice in several aspects. On the one hand, while previous studies on the interval-valued carbon prices forecasting in the EU ETS have majorly relied on decomposition-ensemble and hybrid modeling methodologies, this study introduces a novel approach by leveraging multi-task learning combined with diverse-source data, offering a fresh perspective on the interval-valued forecasting in carbon markets. On the other hand, through the Transformer-based model’s time-varying interpretability mechanism, we illustrate how previous trading days influence the future interval carbon prices and the importance of various variables in predictions. Through this innovation, we are able to sustain an explanation of the predictive capacity of market factors and big data for interval carbon prices. That is, we find strong empirical evidence identifying trading periods or characteristics that predict future carbon market interval prices with meaningful implications. The proposed framework and empirical findings not only build trust in complex deep learning-based forecasting systems but also offer a robust tool for practical applications in carbon market risk management and environmental policy-making.
The remainder of the paper is organized as follows:
Section 2 describes the data sources and preprocessing steps in this study.
Section 3 analyzes the model performance, variable importance, and temporal patterns of the carbon price forecasting. Finally,
Section 4 concludes the paper with a summary of key findings and directions for future research.
2. Methodology
This section introduces the concepts of interval data and multi-task learning, describes data collection and preprocessing, and explains the model setup and evaluation metrics.
Figure 1 illustrates the proposed framework of this study.
2.1. Interval-Valued Data
Interval-valued data represent a specific instance within symbolic data analysis (SDA). The method is better suited for accurately depicting the complexities and fluctuations of real-world scenarios compared to single-valued variables, as it encapsulates greater uncertainty and variability. In SDA, an interval-valued variable, denoted as X, is defined as a mapping from into . For each , corresponds to an interval , where represents the set of closed intervals in .
At each discrete time point
, an interval is characterized as a two-dimensional vector
, where
indicates the lower bound and
signifies the upper bound of the interval, adhering to the condition
. The sequence of intervals is then represented as follows:
with
n expressing the total number of intervals. Specifically, the observed interval at time
t is denoted as
, expressed mathematically as follows:
2.2. Multi-Task Learning
Multi-task learning is a machine learning paradigm designed to harness the useful information present across multiple related tasks, thereby enhancing the generalization performance of all tasks involved. The definition of MTL is as follows: Given m learning tasks where all the tasks or a subset are interrelated, MTL aims to simultaneously learn these m tasks to improve the learning outcomes for each individual task by leveraging the knowledge derived from other tasks.
MTL enhances the learning efficiency of individual sub-tasks by leveraging shared information, ultimately leading to increased accuracy and operational efficiency.
Figure 2 shows the differences between the training methodologies of various learning approaches. The success of MTL in improving forecasting performance is attributable to both the model’s inherent capabilities and the strategic optimization of the employed loss functions.
In the context of deep learning, MTL is executed by deriving shared representations from multiple supervisory signals. Historically, deep multi-task architectures have been categorized into hard and soft parameter sharing techniques. In hard parameter sharing, the parameter set is divided into shared parameters and those specific to each task (see
Figure 2c). Models utilizing hard parameter sharing generally consist of a shared encoder that branches into task-specific heads. Conversely, soft parameter sharing assigns each task its own distinct set of parameters, facilitated by a mechanism that promotes feature sharing across tasks (see
Figure 2d). Moreover, there are also two model types: encoder-focused and decoder-focused architectures. Encoder-focused architectures (see
Figure 2e) restrict information sharing to the encoder phase, employing either hard or soft parameter sharing before each task is decoded using an independent, task-specific head. In contrast, decoder-focused architectures (see
Figure 2f) allow for information exchange during the decoding process as well. The proposed multi-task learning framework in this study is implemented using the TFT model, which aligns with the decoder-focused architecture of deep multi-task learning.
2.3. Data Collection and Preprocessing
In this section, we discuss the selection, collection, and preprocessing of diversity data.
2.3.1. Market Variables
The main predictive variable selected is the interval-valued EUA futures prices expiring on December 24th, traded on the ICE CEX platform. The interval-valued EUA futures price data were sourced from the Investing database (
https://www.investing.com/), encompassing the period from 4 January 2021, to 23 February 2024 (covering EU ETS Phase 4). As illustrated in
Figure 3, the price data exhibit notable non-stationarity, non-linearity, and abrupt volatility, characteristics that necessitate careful modeling approaches.
In this study, to assess the predictive performance of our proposed forecasting framework, we calculated the daily returns of interval-valued prices, i.e., by simultaneously predicting the daily returns of the high and low prices.
Table 1 provides a detailed description of the statistical characteristics of the return rates.
2.3.2. Related Market Variables
Previous studies have shown that fluctuations in energy prices, such as natural gas, electricity, coal, and crude oil, are closely related to changes in EU carbon prices when predicting the EU ETS prices [
7,
8,
18]. Therefore, our research incorporates four key daily-frequency energy prices from the energy sector: NBP natural gas futures, Brent crude oil futures, ICE Rotterdam coal futures, and German power base load futures. The STOXX Europe 600 index futures are also selected as a representative indicator of the current and expected economic conditions, reflecting the exogenous impact on carbon pricing. To ensure a comprehensive analysis, all daily frequency variables were collected from 4 January 2021, to 23 February 2024 (
www.investing.com). This study constructs the input matrix for model prediction based on the trading days of carbon prices. In cases where trading day prices for other relevant market variables are missing, we apply forward filling to impute missing values using the previous day’s prices for the respective variables [
6,
18,
77].
We conducted a Pearson correlation analysis to explore the relationships between the EU ETS allowance prices and related variables. The results as visualized in
Figure 4 provide further insights into the interdependencies among the selected features. The heatmap reveals a strong correlation between the high, low, close, and open prices of the allowances, with coefficients close to 1, reflecting their inherent interdependence. Among the energy market variables, Brent crude oil futures exhibit the highest correlation with the allowance prices (e.g., 0.58 with high price), followed by Rotterdam coal futures and German power baseload futures, while NBP natural gas futures show slightly weaker correlations. These findings align with prior studies suggesting that energy prices significantly influence EU ETS price dynamics due to their direct connection to emission-intensive production processes. The trading volume also shows moderate correlations with the allowance prices, indicating its role as a potential indicator of market activity and liquidity. On the other hand, the STOXX Europe 600 index futures exhibit relatively weak correlations with the allowance prices, suggesting that financial market conditions may play a more indirect role in shaping the EU ETS dynamics.
2.3.3. Unstructured Data and Search Index
The EU ETS, as a policy-driven artificial market, has experienced sharp price fluctuations due to policy changes, unexpected events, and public sentiment. Research has confirmed a direct relationship between media sentiments and fluctuations in the EUA prices, underscoring the influence of news articles on market dynamics [
62,
63,
65]. To ensure representative online news media data, 2750 articles were gathered from the EU ETS section of the Carbon Pulse website (
https://carbon-pulse.com/) between 4 January 2021, and 23 February 2024. The founders of Carbon Pulse possess nearly three decades of experience in carbon market reporting and climate policy analysis. Their commitment to delivering in-depth news and intelligence on global carbon pricing initiatives has established a strong track record in informing market development and global policy-making through a wide array of resources.
Prior to conducting sentiment analysis, the text data were preprocessed to eliminate redundant information that utilized the Natural Language Toolkit (NLTK) to remove punctuation, numbers, excessive whitespace, and English stopwords while also converting the text to lowercase. The sentiment scores of the news texts were computed using VADER (Valence Aware Dictionary and Sentiment Reasoner), a sentiment analysis tool that evaluates text sentiment by referencing a lexicon of words assigned sentiment scores and employing straightforward rules. After calculating the positive and negative sentiment scores for each article, we resampled the data on a daily basis to ensure quality for predictive purposes.
Figure 5 illustrates the daily variations in positive and negative sentiment intensity over time for Carbon Pulse news.
In addition to news articles, investors often leverage various search engines to explore topics of interest, and search indices reflect public attention levels toward specific subjects [
64]. Google Trends provides real-time insights into search behaviors, enabling researchers to measure interest in specific topics across various locations and timeframes. The trends data are categorized by topics, offering a comprehensive perspective on search patterns, and also accommodate individual search terms, which are generally considered more reliable due to their inclusion of precise phrases, spelling variations, and acronyms across multiple languages. For this study, the topic “European Union Emissions Trading System” was selected as representative data for the search index, covering the period from 4 January 2021, to 23 February 2024, with data collected daily. The values of the search index are normalized on a scale from 1 to 100, rather than representing absolute search volumes.
Figure 5 also depicts daily changes in attention data regarding the EU ETS over time.
2.4. Forecasting Models and Parameters Setting
In this study, we chose the Temporal Fusion Transformer to implement multi-task learning for the interval-valued EUA futures prices forecasting because its architectural design inherently supports this approach. The TFT model [
76] employs hard parameter sharing within the encoder phase to extract shared representations from diverse data sources, while task-specific heads are utilized for predicting the upper and lower bounds of the interval-valued carbon prices. Moreover, the model’s interpretability enables the demonstration of the intricate details of the prediction process, fostering greater user trust in complex deep learning-based predictive systems. We also compared a TFT variant configured for single-task learning to ensure a consistent baseline for evaluating the relative benefits of the proposed multi-task approach. To provide a comprehensive benchmarking landscape, we also employ Transformer [
78] and TCN [
79] models under the multi-task learning paradigm, ensuring that all three multi-task frameworks are evaluated on an equal footing. For comparison, single-task variants of these architectures (Transformer and TCN) serve as consistent baselines to isolate the benefits introduced by multi-task learning. Furthermore, four well-established deep learning models—LSTM [
80], DeepAR [
81], DecoderMLP [
82], and GRU [
83]—were utilized as single-task learning benchmark models to evaluate the predictive performance of the proposed framework. The models were selected based on their demonstrated efficacy in time series forecasting tasks and their extensive application in the literature. Furthermore, when selecting benchmark models to compare forecasting ability, extensive literature has confirmed the nonlinearity and non-stationarity of the allowance prices [
58,
60], limiting the applicability of traditional econometric models. Deep learning models generally exhibit superior predictive performance. Therefore, we only selected deep learning models for comparison. The decision not to include specific formulas for the deep learning models utilized in this study stems from several considerations. Firstly, these models are well-established in the literature, and their underlying architectures and mathematical formulations are widely available in numerous academic sources, making detailed re-exposition unnecessary. Secondly, the focus of this manuscript is on the comparative analysis of the models’ predictive performance within the proposed framework rather than on the derivation of their mathematical foundations. By omitting the formulas, we aim to streamline the discussion and concentrate on the practical application and results of the models in the context of our study.
Experiment Setup
A sliding time window of size 5 was selected for all forecasting models determined as the optimal lag based on analyses of the Akaike Information Criterion (AIC). Next, we provide a detailed description of the fundamental parameters employed during model training. The data sequence designated for prediction is divided into training, validation, and test sets, comprising 80%, 10%, and 10% of the total sequence length, respectively. The training set is utilized for initial model learning, while the validation set is used to fine-tune and optimize hyperparameters to enhance training performance. The test set is employed directly to evaluate the performance of the final model. The computing hardware used in this study included an NVIDIA (Santa Clara, CA, USA) GeForce RTX 3090 GPU and an Intel (Santa Clara, CA, USA) Xeon(R) Platinum 8362 CPU. The deep learning framework utilized was PyTorch version 2.0.1. Hyperparameters were automatically optimized using Optuna (
https://optuna.org/). A summary of the parameters used for training the forecasting models is presented in
Table 2.
2.5. Evaluation Criteria
To rigorously assess the forecasting performance of the proposed multi-task learning framework and benchmark models, we employ six widely recognized metrics tailored specifically for interval-valued predictions [
43,
84].
First, we utilize the interval
U of Theil statistics (
) and the interval average relative variance (
), defined as follows:
.
Here, m represents the number of intervals in the test set, indicates the forecasted interval at time t, and denotes the sample mean of the interval, where is the mean of the upper bounds and is the mean of the lower bounds.
Both
and
are commonly used to compare forecasting errors between the reference model and a naïve model [
85]. Specifically,
is utilized to evaluate the forecasting errors of the random walk model against the reference model. A value of
indicates that the reference model performs worse than the random walk model,
uggests equivalent performance, and
indicates that the reference model outperforms the random walk model. As
approaches zero, the reference model’s performance is deemed perfect. Similarly,
compares the reference model’s errors to the average of the series, with lower values indicating better forecasts. Notably,
denotes a perfect reference model, while
implies that the model performs similarly to the series average. Importantly, these metrics account for both upper and lower bound forecasting errors simultaneously and are scale-invariant with respect to the time series.
Second, we employ the mean squared error of interval (
) [
86] alongside two distance measures: the mean distance error based on the Ichino–Yaguchi distance (
) and the Hausdorff distance (
) [
87], defined as follows:
where
and
represent the center and radius of the
j-th interval, while
and
denote the center and radius of the forecasted interval [
88]. Consequently,
and
represent the positional and length errors between the actual and forecasted intervals, respectively. Thus,
captures both positional and length errors, with lower values indicating superior forecasts.
quantifies the deviation of the interval’s minimum and maximum, while
assesses the deviation of the center and radius.
Third, to ensure practical applicability in the financial domain, a robust prediction model must exhibit not only strong fitting performance but also superior predictive capability. Therefore, we incorporate interval directional statistics (
) to evaluate the model’s predictability, defined as follows:
A larger value of signifies a higher predictive ability of the model.
3. Results and Discussion
In this section, we begin by comparing the forecasting results of various models, followed by an ablation study to assess the contribution of integrated big data. We then proceed to analyze the model’s interpretability, and conclude with a robustness analysis of the prediction results.
3.1. Performance Evaluation
The performance of various forecasting models for the interval-valued EUA prices across three forecasting horizons—1 day-ahead, 3 days-ahead, and 5 days-ahead—is summarized in
Table 3. The results highlight the effectiveness of the proposed multi-task learning framework (TFT*) in comparison to multi-task and single-task benchmark models.
Overall, the TFT* model demonstrates consistent superiority across all forecasting horizons, achieving lower error metrics and maintaining interval directional accuracy. Notably, the values for TFT* remained below 1 across all horizons, indicating its ability to outperform the random walk model. In contrast, the multi-task (Transformer* and TCN*) and single-task benchmarks (Transformer, TCN, LSTM, GRU, DeepAR, and DecoderMLP) show declining performance as the forecasting horizon extends, with values exceeding 1 in longer horizons, signaling their limited ability.
Focusing on the 1 day-ahead forecasts, the TFT* model outperforms all other models across most evaluation metrics, indicating its effectiveness in capturing immediate price dynamics. Transformer* and TCN* also exhibit solid performance, although their error values are slightly higher and directional accuracy somewhat lower. Among the single-task variants, TFT and LSTM achieve competitive values, but their higher error measures indicate challenges in maintaining precise interval estimates. As the forecasting horizon extended to 3 days-ahead, the TFT* model maintained robust performance, continuing to outperform the random walk model and demonstrating its adaptability over slightly longer periods. However, other benchmark models show increased error rates, reflecting challenges in maintaining prediction precision over multiple days. At the 5 days-ahead horizon, the performance gap widens further. The TFT* model remains the top performer, maintaining balanced interval-valued accuracy and directional consistency, while all the benchmarks have no ability to surpass the random walk model.
A more intuitive assessment of model performance is presented in
Figure 6, which offers a 3D visual comparison of the predictive models across three forecasting horizons. As depicted, prediction accuracy generally declines as the forecasting horizon lengthens, consistent with typical patterns observed in time series forecasting.
We further conducted a relative percentage improvement analysis to compare the TFT* model with the benchmark models within 1 day-ahead forecasting since it is the most crucial horizon for real-time decision-making in the carbon market. As shown in
Figure 7, the proposed multi-task learning model demonstrates a consistent advantage in the accuracy of interval-valued predictions, such as showing a relative improvement of approximately 40% compared to DeepAR and DecoderMLP, particularly in
and
metrics. While the enhancement is somewhat reduced when compared to the multi-task learning Transformer* and TCN* models, it remains statistically vivid. In terms of
, the TFT* model shows a relative improvement of 3.59% over multi-task and single-task Transformer, 5.89% over multi-task and single-task TCN, 9.30% over GRU, 11.63% over DeepAR, and 53.48% over DecoderMLP. Although its performance is comparable to that of single-task learning TFT and LSTM, TFT* achieves greater enhancements across other quantitative prediction standards.
3.2. Ablation Analysis
To evaluate the efficacy of incorporating the diverse data streams from multiple sources into the interval-valued EUA prices prediction, we conducted a series of ablation experiments. As detailed in
Table 4, the experiments systematically assess the impact of integrating various data categories into both single-task and multi-task learning frameworks to enhance predictive performance.
In the single-task learning scenario, TFT represents the configuration that integrates all the diverse-source variables, including carbon market internal variables, related market futures prices, news sentiment intensity, and search index data. TFT (Category 1) serves as the baseline single-task model, incorporating only the carbon market internal variables for forecasting. In the multi-task learning framework, TFT* models are categorized based on the data sources they utilize. TFT* (Category 1) is the baseline multi-task model that incorporates only the carbon market internal variables. Building on this baseline, TFT* (Category 2) adds the related market futures price variables, expanding the input scope beyond the carbon market data. TFT* (Category 3) further incorporates the search index data into the configuration of TFT* (Category 2), reflecting the influence of public interest on the market dynamics. TFT* (Category 4) builds on TFT* (Category 2) by incorporating the news sentiment intensity, providing a complementary layer of information on the market sentiment. Finally, TFT* integrates all the diverse-source variables, representing the complete configuration of the proposed multi-task learning framework.
The results in
Table 4 demonstrate the incremental benefits of integrating additional data sources for both single-task and multi-task learning frameworks. In the single-task scenario, TFT (Category 1), which incorporates only the carbon market internal variables, exhibits the lowest predictive accuracy among all configurations. The inclusion of additional data sources in TFT improves performance, but it consistently lags behind its multi-task counterparts across all evaluation metrics. For the multi-task learning framework, the baseline model, TFT* (Category 1), demonstrates the lowest predictive accuracy among all configurations. Expanding the input scope by incorporating additional data sources greatly improves performance. For instance, TFT* (Category 2), which includes the related market futures prices, achieves a reduction in
by 7.93% and
by 8.10%, alongside an 8.34% improvement in interval directional accuracy (
) compared to the baseline. Further enhancement is observed in TFT* (Category 3), where the addition of the search index data to TFT* (Category 2) results in a further reduction in
by 8.16% and
by 10.41%, while maintaining the same 8.34% improvement in
. Similarly, TFT* (Category 4), which integrates the news sentiment intensity instead of the search index data into TFT* (Category 2), demonstrates a marked improvement, with
increasing by 21.43% compared to the baseline, highlighting the substantial influence of sentiment data on prediction performance. This phenomenon has also been highlighted in other studies [
62,
63,
66], which have demonstrated that incorporating indices derived from online news can largely enhance the accuracy of models in forecasting the allowance prices within the EU ETS. These results underscore the importance of combining the carbon market internal variables with the related market futures prices, search index data, and news sentiment intensity for robust interval-valued carbon price forecasting in the EU ETS.
3.3. Interpretability Use Cases
After confirming the performance advantages of the model, we used the TFT model under multi-task learning to demonstrate two interpretability use cases: one is the visualization of temporal patterns for the time index used in the model encoder, and the other is assessing the importance of each input variable in the prediction.
3.3.1. Visualizing Temporal Patterns
The interpretability of the TFT model provided insights into the importance of different time indices within the sliding window for the interval-valued allowance prices forecasting when leveraging multi-task learning.
Figure 8 illustrates the attention weight patterns assigned to time indices during one-step predictions on the test dataset, highlighting the temporal focus for forecasting high and low prices of the carbon emission allowances.
For high price predictions, the multi-task learning framework consistently allocated large attention to a specific trading day within certain weekly forecasting periods. The pattern indicates that the model identifies this day as a great influential index for predicting subsequent price peaks. Notably, this attention is persistently directed toward the same time index across multiple weeks. In contrast, for low price predictions, the model assigned higher attention weights to the three most recent trading days within the sliding window, suggesting that these time indices were critical for capturing the factors influencing minimum price predictions, with relatively lower attention allocated to earlier time steps.
3.3.2. Analyzing Variable Importance
The importance of each input variable in the multi-task learning framework of the TFT model was quantified by analyzing the selection weights obtained during predictions. These weights were aggregated across the entire test set to create an importance distribution for each variable, revealing the key inputs driving the forecasting process.
Figure 9 provides a heatmap depicting the relative importance of all input variables over different time steps, along with a summary bar chart on the right showing the overall contribution of each variable. Each heatmap block along the horizontal axis represents the features that the model emphasizes when predicting the interval carbon prices at a specific time point, quantitatively showing which feature segments contribute more to the prediction. For instance, as the test set progresses, if the attention weights of multiple features in the matrix simultaneously increase, it may indicate strong interactive effects among them during that period.
The analysis highlighted that the close price of carbon allowances was the most influential variable, accounting for over half of the total importance. The finding underscores the central role of the close price in the forecasting process, as it serves as a comprehensive indicator of market sentiment and daily performance, critical for generating accurate predictions. The second most important set of variables includes energy market factors, such as natural gas, coal, and crude oil prices. The energy inputs greatly impact the model’s predictions, reflecting the intrinsic link between energy prices and carbon costs, as fossil fuel combustion is a major source of emissions. The importance of energy prices in forecasting EUA futures prices has similarly been highlighted in interpretability analyses of the close prices prediction [
7,
89]. While the financial market variable (STOXX Europe 600 index), shows a secondary influence, it still provides contextual information that indirectly affects the allowance price forecasts. The financial market conditions affect investment flows and liquidity within the carbon market, making it crucial for risk managers to remain aware of the influences during periods of economic volatility. Salvagnin et al. [
77] pointed out that during the transition of the EU ETS to Phase IV, the influence of financial market volatility appeared to take a central role. Regarding the EUA futures high-open–low-close price data and trading volume, the analysis reveals that the open price and trading volume are more influential in the forecasting process than the high and low prices. This suggests that early trading signals and overall market activity provide more reliable information for predicting price movements. Lastly, online news sentiment intensity and the search index demonstrates that these variables contribute to the prediction, albeit with slightly less importance than market-related data. Positive news sentiment exerts a stronger influence on predictions than negative sentiment, reflecting the market’s tendency to respond more importantly to optimistic developments.
3.4. Robustness Analysis
3.4.1. Forecasting Robustness Under Different Market Conditions
To evaluate the reliability of our proposed forecasting framework, we conducted an extensive analysis across diverse market conditions during the test period, focusing on interval directional accuracy (
). Specifically, we used average-based thresholds to separate (i) high and low volatility, and (ii) high and low global (GEPU) and European (EEPU) policy uncertainty, reflecting the crucial influence of policy fluctuations, macroeconomic disruptions, and political events on the EU ETS prices behavior. In addition, we examined two further market characteristics: (iii) market liquidity, defined by average trading volumes, and (iv) energy price levels, proxied by the mean NBP natural gas futures prices (natural gas plays a significant role in Europe’s energy consumption and power generation structure and is closely linked to the carbon market [
3,
6,
18]). Such an extension allowed a more comprehensive view of how varying liquidity and energy costs might affect carbon price predictability.
As summarized in
Table 5, the proposed multi-task learning framework (TFT*) consistently outperformed other models in high uncertainty, high volatility, high liquidity, and both high and low energy price scenarios, demonstrating its ability to effectively capture market dynamics where price movements were more volatile and harder to predict. Even under low-volatility and low-uncertainty scenarios, where market dynamics were more stable, TFT* maintained competitive accuracy. While single-task benchmarks such as LSTM and GRU achieved comparable or slightly higher scores in some low-volatility or low-uncertainty environments, their performance degraded largely in high-volatility or high-uncertainty scenarios. The contrast demonstrates the limitations of single-task learning approaches in handling complex market conditions and highlights the advantage of the multi-task learning framework in ensuring robust and reliable predictions across diverse market environments.
3.4.2. Superior Predictive Ability Test
The Superior Predictive Ability (SPA) test [
90] conducted in this study served as a rigorous evaluation of the proposed multi-task learning framework compared to other forecasting algorithms. The statistical values shown in
Table 6 are the
p-values resulting from the SPA test. The resulting p-values revealed whether the forecasting accuracy of the proposed model is meaningfully greater than that of its comparators. When the
p-value falls below 0.05, it denotes statistical significance, suggesting that the test model delivers superior predictive performance relative to the benchmark. Furthermore, our analysis confined itself exclusively to 1 day-ahead forecasting outcomes, a framework widely adopted and practically essential within the carbon prices prediction. As shown in Panel A, the multi-task TFT framework (TFT*) consistently outperforms benchmark models for high price predictions, with statistical significance in most comparisons. Similarly, Panel B for low-price predictions further validates its robustness.
4. Conclusions
In this paper, we have proposed a novel multi-task learning-based framework for the prediction of interval-valued EU ETS carbon allowance futures prices. We utilized a Transformer-based model (Temporal Fusion Transformer) to implement multi-task learning and forecast interval-valued carbon prices by integrating diverse-source data, including online news, Google search trends, and market-related futures prices. Our findings demonstrated that the proposed multi-task learning framework consistently outperformed all benchmark models in predictive accuracy and exhibited robustness under conditions of high market volatility or economic policy uncertainty. Ablation experiments revealed that incorporating either online news sentiment or search trend data individually improved the model’s predictive performance. When both news sentiment intensity and search index data were integrated, the model achieved the highest level of predictive accuracy, indicating that the combined use of diverse big data sources effectively captured the complex dynamics of the carbon market. The interpretability analysis offered deeper insights into the factors influencing carbon prices. In this study, the model’s attention mechanisms for the test period indicated that a specific trading day within certain weekly periods considerably influenced high price predictions, highlighting its importance for identifying price peaks. For low price predictions, the model allocated substantial attention to the three trading days preceding the prediction date, underscoring their critical role in determining minimum price forecasts. Moreover, variable importance analysis confirmed that the carbon allowance close price was the most influential factor, followed by energy market variables such as natural gas, coal, and crude oil prices. Online news sentiment and search index data also contributed meaningfully to the forecasting process, with positive news sentiment exerting a stronger influence than negative sentiment.
The findings carry important policy implications and practical significance for stakeholders in the carbon market and for advancing environmental management. Policymakers and regulators could leverage the model’s insights to better understand how market sentiment and energy prices affect the EU ETS, enabling more informed policy interventions. Investors and companies could monitor the key factors identified during the forecasting process to optimize compliance strategies, low-carbon technology investments, and risk management against price volatility. The interpretability of the proposed framework facilitates user trust in complex deep learning-based predictive models, providing greater transparency in decision-making processes.
Future research could build on these findings by exploring the applicability of the proposed framework to other carbon markets or broader financial markets where interval-valued data plays a key role. Furthermore, the integration of multimodal data related to the carbon market (e.g., images, videos, audio) for the allowance prices prediction is expected to become a critical application direction driven by advancements in deep learning technologies.