Consensus and Divergence in Explainable AI (XAI): Evaluating Global Feature-Ranking Consistency with Empirical Evidence from Solar Energy Forecasting

Thinn, Kay Thari; Saeed, Waddah

doi:10.3390/math14020297

Open AccessFeature PaperArticle

Consensus and Divergence in Explainable AI (XAI): Evaluating Global Feature-Ranking Consistency with Empirical Evidence from Solar Energy Forecasting

by

Kay Thari Thinn

and

Waddah Saeed

^*

School of Computer Science and Informatics, De Montfort University, Leicester LE1 9BH, UK

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(2), 297; https://doi.org/10.3390/math14020297

Submission received: 11 November 2025 / Revised: 4 January 2026 / Accepted: 9 January 2026 / Published: 14 January 2026

(This article belongs to the Special Issue Applied Mathematics in Artificial Intelligence: Methods, Algorithms, and Applications)

Download

Browse Figures

Versions Notes

Abstract

The growing reliance on solar energy necessitates robust and interpretable forecasting models for stable grid management. Current research frequently employs Explainable AI (XAI) to glean insights from complex black-box models, yet the reliability and consistency of these explanations remain largely unvalidated. Inconsistent feature attributions can mislead grid operators by incorrectly identifying the dominant drivers of solar generation, thereby affecting operational planning, reserve allocation, and trust in AI-assisted decision-making. This study addresses this critical gap by conducting a systematic statistical evaluation of feature rankings generated by multiple XAI methods, including model-agnostic (SHAP, PDP, PFI, ALE) and model-specific (Split- and Gain-based) techniques, within a time-series regression context. Using a LightGBM model for one-day-ahead solar power forecasting across four sites in Calgary, Canada, we evaluate consensus and divergence using the Friedman test, Kendall’s W, and Spearman’s rank correlation. To ensure the generalizability of our findings, we further validate the results using a CatBoost model. Our results show a strong overall agreement across methods (Kendall’s W: 0.90–0.94), with no statistically significant difference in ranking (p > 0.05). However, pairwise analysis reveals that the “Split” method frequently diverges from other techniques, exhibiting lower correlation scores. These findings suggest that while XAI consensus is high, relying on a single method—particularly the split count—poses risks. We recommend employing multi-method XAI and using agreement as an explicit diagnostic to ensure transparent and reliable solar energy predictions.

Keywords:

Explainable AI; LightGBM; machine learning; SHAP; Partial Dependence Plot; Permutation Feature Importance; Accumulated Local Effect; solar energy

MSC:

62M10; 62P30; 68T01; 68T09

1. Introduction

The increasing occurrence of extreme weather and natural disasters in the 21st century has escalated the demand for transition towards green and low-carbon energy solutions. One of the most important parts of this change is the widespread implementation of renewable energy sources, which are valued for their clean and abundant nature. These sources include wind power, solar energy, hydroelectric power, and others. Solar energy has emerged as a key resource for reducing fossil fuel dependence, with widespread application in industrial, commercial, and residential sectors.

However, changes in solar energy output at different times, in different seasons, and under different weather conditions cause uncertainty in energy production and create challenges for managing the power grid. Accurate forecasting of solar power becomes essential for handling these challenges [1]. While traditional statistical models use historical data to forecast solar power, they often struggle to capture the data’s inherent nonlinearity and temporal complexity. Black-box machine learning models have gained recognition for their predictive performance and ability to handle nonlinear patterns [2]. Despite their effectiveness, these models often lack transparency and interpretability, making it hard to fully rely on them for decision-making. Explainable AI (XAI) techniques are often utilized along with black-box models to understand the behaviour of the model and gain insights into factors that influence energy production, aiding in optimizing, planning, and controlling systems [3,4].

Accurate forecasting of solar energy production is crucial for ensuring an efficient and stable energy system, as the integration of renewable energy sources into electricity grids continues to grow. Gradient boosting tree models like LightGBM [5] exhibit strong performance in capturing the nonlinear and dynamic nature of solar energy time-series data. Even though LightGBM already incorporates built-in feature selection, it is often seen as an opaque or black box when it comes to interpretability, limiting the ability to understand the reasons behind its decisions.

To address this interpretability issue, different XAI techniques have been employed alongside the LightGBM model to enhance the transparency of the model’s decision-making. Some interpretation methods are model-specific, relying on the built-in mechanisms of certain machine learning models, while others are model-agnostic, which can be applied externally to any black-box model. While XAI methods have proven impactful, most of the existing research relies solely on one or two model-agnostic methods or on model-specific approaches, without evaluating the consistency of insights across different frameworks. This gap exists as different XAI methods emerged from varying theoretical principles and algorithmic structures, which can lead to conflicting interpretations of the same model. As a result, this leaves a notable gap in knowledge when it comes to choosing a trustworthy and computationally feasible XAI.

To assess both the consensus and divergence among these methods, a comparative analysis of multiple XAI techniques, including both model-agnostic and model-specific (built-in mechanism) approaches, followed by statistical analysis, is essential. The statistical significance of differences in their explanations also remains unexplored. Hence, this research seeks to conduct a comparative analysis of several XAI methods by statistically evaluating their outputs and ultimately improving the transparency and trustworthiness of solar energy forecasting models.

2. Related Works

2.1. Explainable AI

Explainable AI has become important for enhancing the transparency and interpretability of machine learning models, especially in areas such as healthcare, finance, energy, and education. While the implementation of XAI methods is becoming increasingly accessible, it is also important to understand the theoretical foundation of each technique to ensure its appropriate application and interpretation. XAI techniques are generally divided into two categories: model-agnostic and model-specific.

Model-agnostic methods operate as external tools and can be integrated with any machine learning model, regardless of its internal structure. These methods provide insights into the model’s decision-making process without needing to assess its underlying mechanism.

Shapley Additive Explanations (SHAP) [6] is a model-agnostic method. It is an interpretation framework based on Shapley values from cooperative game theory, where features are treated as players in a game, and the prediction represents the outcome or score of the game. SHAP ranks the features fairly by averaging the marginal contribution effect of each player (feature) across all possible combinations to determine its impact on the model’s prediction [7].

Another model-agnostic approach is Permutation Feature Importance (PFI). Breiman [8] first introduced the theoretical foundation of PFI with random forest. The author stated that the drop in the model’s accuracy was measured by randomly shuffling each feature, breaking the relation with the other features and the target variable. The more significant the drop in accuracy, the more important the feature is considered to be.

Partial Dependence Plot (PDP), developed by Friedman [9], ranks the features by changing the values of one feature while keeping the others constant. Then, it measures the change in the prediction, capturing the marginal effect of the changed feature. A flat PDP curve represents little or no change in the outcome, while variations in the plot indicate a strong impact on the prediction.

Apley and Zhu [10] proposed Accumulated Local Effects (ALE), which calculates the local changes in a prediction within small intervals and then adds up these changes to capture the feature’s overall effect, instead of averaging the marginal effects over the entire feature distribution. The authors then compared the performance of ALE and PDP and argued that ALE outperformed PDP, avoiding extrapolation when the features were strongly correlated to each other.

The basis of model-specific XAI methods lies within the internal mechanism of machine learning models. These approaches use the structural characteristics of the model itself to generate explanations. For tree-based models, such as decision trees or gradient boosting decision trees, model-specific methods are particularly effective due to the inherently structured nature of decision rules and feature splits. These methods can explain not only which features are important but also how they interact and influence predictions.

All XAI methods, whether model-agnostic or model-specific, play an important role in boosting the transparency of black-box machine learning models, supporting both feature selection and interpretation. While the model-agnostic methods provide both global and local insights, the model-specific approaches focus only on global explanation. Despite the growing use of XAI, most studies focus on only one or two methods and lack rigorous comparative analysis across different agnostic and specific frameworks.

2.2. LightGBM

Light Gradient Boosting Machine (LightGBM) [5] has emerged as a powerful and efficient gradient boosting framework widely used across domains for both classification and regression tasks. Its core strengths, such as speed, scalability, support for parallel learning, and capacity to handle high-dimensional, heterogeneous datasets, have made it a popular choice in energy forecasting, healthcare, cybersecurity, and financial analytics.

In the field of time-series forecasting, LightGBM has demonstrated remarkable performance due to its ability to model nonlinear dependencies and interactions among features. Munir et al. [11] showed that a LightGBM-based model achieved the lowest Root Mean Square Error (RMSE) in energy prediction. The model’s ability to capture nonlinear relationships and feature interactions contributes to more accurate and reliable predictions, which are essential for energy planning, cost optimization, and resource allocation in both household and industrial environments.

In the cybersecurity sector, Agrawal et al. [12] proposed a LightGBM-based approach for cyber event detection in power grid systems. Their model integrated concept drift detection and retraining mechanisms, enabling dynamic adaptation to evolving threats. In a Hardware-in-the-Loop (HIL) simulation environment, it achieved over 97% accuracy, showcasing LightGBM’s robustness in real-time, multi-class classification under streaming data conditions.

For financial fraud detection, Zhao, Zhang and Li [13] developed a Bayesian-optimized LightGBM model using data from 2014 to 2023, incorporating features such as executive profiles, ESG scores, and innovation indicators. The model achieved over 95% accuracy, outperforming decision trees, SVM, and random forests.

These studies collectively demonstrate that LightGBM offers a favourable balance between predictive accuracy, computational efficiency, and robustness compared to both traditional gradient-boosted models and deep learning approaches. Unlike deep neural networks, which often require large datasets, extensive hyperparameter tuning, and higher computational cost, LightGBM achieves strong performance with faster training and better interpretability—key advantages for practical solar power forecasting applications. Therefore, these key advantages provide a strong justification for its selection in this study.

2.3. XAI in Solar Energy

While advanced machine learning models have achieved significant improvements in forecasting accuracy, their opaque decision-making processes often make them unsuitable for critical applications without interpretability. This is especially important in the solar energy sector, where understanding the drivers of predictions is essential for planning, safety, and grid management. Explainable Artificial Intelligence (XAI) provides tools and techniques to bridge this gap by making model decisions transparent, traceable, and trustworthy.

Chaibi et al. [14] employed SHAP and PFI to interpret and enhance the predictive performance of a LightGBM model for estimating daily global solar radiation. PFI identified solar radiation and sunshine duration fraction (SF) as the most influential predictors. SHAP showed how radiation and SF positively impacted predictions. Combining SHAP and PFI enabled transparent model interpretation and allowed the selection of optimal features, resulting in a simplified, more accurate LightGBM model.

Similarly, Kuzlu et al. [15] applied LIME, SHAP, and ELI5 to improve the interpretability of tree-based models for solar time-series forecasting. SHAP was used to provide global explanations by quantifying the contribution of each variable across the dataset. LIME offered local, instance-based interpretations of individual predictions. ELI5 supported model transparency through visualizations and feature contribution summaries. By integrating these XAI techniques, the study enhanced model transparency and aimed to highlight the potential of XAI in solar forecasting.

Nallakaruppan et al. [16] employed XAI in enhancing transparency and trust in solar power prediction models. By applying LIME for local explanations and PDP for visualizing feature interactions, the study identified the key factors driving the power output.

Park et al. [17] used LightGBM not only for accurate multistep-ahead solar radiation forecasting but also as a model-specific XAI tool. LightGBM inherently provides feature importance metrics, allowing direct interpretation of which input variables are most influential in making predictions, without needing external explainability methods.

These studies highlighted the essential role of XAI in solar energy forecasting, not only to boost stakeholder trust and improve decision-making but also to support model debugging, feature selection, and model performance degradation monitoring. As solar forecasting models become more complex and data-rich, integrating XAI into the development pipeline is becoming a best practice, particularly in safety-critical applications like grid operations and energy policy.

3. Materials and Methods

The methodology used in this study is divided into several tasks. These tasks are detailed in the following subsections.

3.1. Data Collection

The dataset used for implementing the solar energy prediction comprises hourly energy production records in kilowatt-hours (kWh) from various sites in Calgary, Canada. The original dataset spans from September 2015 until 7 July 2025 (accessed on 10 July 2025). Hourly temperature and humidity data for Calgary for the same period were downloaded from https://calgary.weatherstats.ca/download.html, accessed on 10 July 2025.

3.2. Data Pre-Processing

The dataset included 8 columns, covering 12 different site locations. The attributes were Site name, representing the solar panel location; ID; Address; Date, which is the record of the production data; Energy Production in kWh; Public URL; Installation Date; and UID.

For the analysis, four specific site locations were selected: Southland Leisure Center, Whitehorn Multi-Service Centre, City of Calgary North Corporate Warehouse, and Glenmore Water Treatment Plant. Each site was analyzed and modelled separately. The data between the dates of 1 January 2020 and 30 June 2025 were extracted for analysis.

Sudden and extreme spikes were observed starting in late 2023, around October. The spikes are dramatically higher than those in earlier records and likely resulted from inconsistencies in the unit, being recorded in Wh instead of kWh. To correct this data issue, all the production records from the affected dates onward were divided by 1000. The anomalies that remained after changing the unit were treated as missing values and were imputed with the chosen method in the same way as other missing values.

Missing Hour Imputation

Hourly timestamps were generated to align with energy production records and identify missing hourly data in all sites. Missing hourly timestamps were detected by comparing the expected hourly intervals with the actual production records. Moreover, huge gaps of consecutive missing hours were found in all sites, highlighting data quality issues.

The missing hours might be due to sensor malfunction, communication failure, or maintenance. The non-parametric regression imputation of random forest has been used to impute hourly missing timestamps in the data. Random forest [4] is a tree-based imputation method that works well with nonlinear relations and interactions and avoids overfitting. It creates many decision trees, each of which makes a prediction, and then the average prediction is used to impute the values. Moreover, it is robust to outliers and performs well with significant and continuous missing data, which traditional imputation methods are not robust enough to handle [18].

In this study, the random forest model uses the energy output as a target variable, while other features act as predictors. The hour of day, day of year, month, and year were created for the imputation. Since the hourly output of energy production followed a bell-shaped curve with production values close to zero (0 or 0.001) in all three sites during nighttime hours, the imputations for the hours between 10 pm and 4 am were set to 0 accordingly, and random forest imputation was only applied to the remaining timestamps. While this approach integrates continuity in the dataset, such large imputation gaps may still affect overall model performance. Different thresholds were used for different locations to impute the anomaly detected.

3.3. Feature Engineering

Feature engineering is the transformation of raw data into meaningful input variables, which is a critical process, especially for predictive modelling in time-series analysis. In this study, the dataset comprises only minimal features, which are hourly timestamps and production values. Due to this limitation, effective feature engineering was essential for extracting the relevant derivatives impacting the solar output.

3.3.1. Temporal and Seasonal Features

The hour_of_day, day_of_year, month, and year had already been created for the hourly timestamp data needed for the imputation. The hour_of_day was created to extract the hourly timeframe in a day, which influences the energy output, depending on the sunlight availability. The day_of_week was used to capture the weekly patterns in the energy output, while Is_Weekend was used to identify the production between the weekdays and weekends. The day_of_month was added to extract the monthly periodic patterns, and the day_of_year was extracted to account for variations in daylight duration throughout the year, which can influence solar output.

Additionally, the season feature was created to capture the seasonal variations in energy output based on Calgary’s climate. The seasonal categories are defined as follows: spring (1 March to 31 May), summer (1 June to 31 August), autumn (1 September to 30 November), and winter (1 December to 28/29 February).

3.3.2. Cyclical Features

The cyclical encoding of the hour of the day (using sine and cosine), along with the cyclical encoding of the day of the year (also using sine and cosine), are used to capture the natural cyclical patterns observed in solar generation data. Such features were also utilized in [19].

3.3.3. Rolling Average

The rolling average of a 4 h window was created as a feature to capture the short-term fluctuations and average production over the last four hours. Rolling average windows of 2, 6, 8, and 12 h were also evaluated at a single location. However, as these did not yield significant improvements in model performance, they were not implemented; instead, only the 4 h rolling average window was used.

3.3.4. Lag Effects

Lag features were also created to reflect the temporal dependencies, such as the influence of the output in the previous hour on the current production. To identify the lag effect, the KPSS test [20], along with the ADF test [21], was used to ensure the reliability of the stationarity result. The KPSS and ADF tests assume opposite null hypotheses: the ADF test assumes that the time series is non-stationary under the null, so a p-value lower than the significance level of 0.05 indicates stationarity, while the KPSS test assumes stationarity under the null. Hence, a p-value lower than the significance level indicates non-stationarity.

The time series are usually differenced once or twice to obtain the lag effects in the data. The lag effects are monitored from the partial autocorrelation plot (PACF) by monitoring spikes that occur in the plot.

After first-order differencing, both ADF and KPSS confirmed the stationarity of the data. As shown in Figure 1, there was a significant spike in lag 1. Hence, ‘lag_1’ was created and added as a feature to the model. For simplicity, the lag-one (lag 1) values for temperature and humidity were also used as inputs to the model.

3.4. LightGBM Predictive Model Implementation

A time-series split was used to train and validate the performance of the LightGBM model. This type of split is appropriate because the data is ordered, which ensures that we do not train models using future data. Ten splits were used, with each validation set consisting of 24 data points. Therefore, a total of 240 data points were used to validate the performance of the models.

After checking several parameter settings, the final LightGBM models were trained using the below setting.

lgbm_model = lgb.LGBMRegressor(
          random_state = 42, n_estimators = 2000,
          learning_rate = 0.03, num_leaves = 63,
          max_depth = 10, min_data_in_leaf = 30,
          feature_fraction = 0.8, bagging_fraction = 0.8,
          bagging_freq = 5)

When generating forecasts for the validation set, the humidity and temperature data from the day before was used. This approach was implemented because this data will not be available in a real-world scenario at the time the forecasts are made. Additionally, the forecasts produced by the model were used to update the lag and rolling mean features for the next step (a technique known as recursive forecasting).

3.5. Performance Metrics

The performance of the model was mainly evaluated using key performance metrics, Root Mean Square Error (RMSE) and Mean Absolute Error (MAE), to measure the accuracy.

MAE calculates the average of the absolute difference between actual and predicted values. It indicates how much the prediction differs from the actual values. On the other hand, RMSE is the square root of the average of squared differences between actual and predicted values. Below is how both metrics can be calculated.

M A E = \frac{1}{m} \sum_{i = 1}^{m} | X_{i} - Y_{i} |

R M S E = \sqrt{\frac{1}{m} \sum_{i = 1}^{m} {(X_{i} - Y_{i})}^{2}}

where m = total number of data points; Xi = predicted value; and Yi = actual value. These metrics offer more robust insights into the prediction accuracy, contributing to a more reliable evaluation of the model’s performance.

3.6. XAI Integration

After implementation of the LightGBM model, both model-specific and model-agnostic Explainable AI techniques were implemented to interpret the LightGBM model’s decision-making process. The split count measures the frequency of the feature being used to split the data. A higher split count indicates greater importance of that feature. Gain represents the reduction in error caused by a split using a specific feature—the more a feature contributes to improving the model’s predictive performance, the higher its gain. SHAP ranks the features by averaging their marginal contribution effects, while PFI measures the drop in model performance when features are shuffled. PDP visualizes the change in predicted outcomes by altering the value of a specific feature while keeping others constant, and ALE works within small intervals, accumulating the average local changes in prediction.

Altogether, features extracted through feature engineering—lag_1, roll_mean_4, hour_of_day, day_of_year, day_of_week, day_of_month, week, year, day_of_year_sin, day_of_year_cos, hour_of_day_sin, hour_of_day_cos, DaySin, DayCos, YearSin, YearCos, humidity_lag1, and temperature_lag1—were used as inputs to XAI models to identify the most impactful predictors, as determined by the best-performing LightGBM model. Since different XAI methods can yield varying feature importance scores due to their distinct calculation mechanisms, a ranking-based approach was applied to obtain a more consistent and interpretable measure of feature impact.

After obtaining the feature importance rankings of various XAI methods, statistical tests were performed to assess the consensus and divergence among the methods to determine whether there were statistically significant differences between the methods. The non-parametric Friedman test [22] evaluates significance by the resulting p-value. If the p-value is less than 0.05, the null hypothesis (there is no significant difference between the XAI methods) will be rejected. It is assumed that at least one XAI method ranks features significantly differently from the others. If a difference was detected, post hoc analysis using the Nemenyi test [23] was conducted as a pairwise comparison to determine which XAI method ranks features differently.

If the Friedman test failed to reject the null hypothesis, Kendall’s W test was consecutively implemented to measure the agreement level between the applied XAI methods. Spearman’s rank correlation heatmap was also used to visualize the pairwise rank correlations between the methods.

4. Results

4.1. Trend Analysis

The highest energy production recorded at the Southland Leisure Center was 113,679.06, while the lowest was 0.0 kWh. A total of 24,633 missing hourly records were found and the consecutive missing gaps analyzed. The longest gap spanned 838 h, which is approximately 35 days, and occurred between 29 March 2021 and 3 May 2021.

Since a huge increase in solar output was discovered starting on 30 September 2023, as shown in Figure 2, the values were divided by 1000 to change them to kWh. As shown in Figure 3, an anomaly was found, which went up to around 250 kWh after the conversion. Since all the other data lay under 150 kWh throughout the years after changing the unit to kWh, the threshold was set to 150, and the anomalies were imputed by random forest imputation. After data pre-processing and imputation, the highest production was 123.6 kWh while the lowest was 0.0 kWh, and the time series showed a clear seasonal pattern, as shown in Figure 4.

In Figure 5a, the line chart of hour_of_day displays a bell-shaped curve, with peak production at midday, with 1 pm being the highest, followed by noon and 2 pm. Production between the hours of 9 pm and 5 am was close to 0.0 kWh, which can be considered no energy output. The average production by day_of_month indicated random distributions with multiple peaks and dips throughout the month. In terms of day_of_week, Wednesdays had the lowest average production, while Saturdays recorded the highest, followed closely by Fridays. However, the average production was quite similar between weekdays and weekends, which suggested that the distinction between them may not be a significant feature. The feature month showed a clear average production pattern throughout the year, which revealed higher energy output during the summer and lower output during winter, with the highest average production being in the month of July and the lowest average production in December.

The chart of the yearly average production displayed a downward trend, with a slight increase in 2021, followed by another decline in 2022 and a further decrease in 2023 and 2024, as shown in Figure 5f. The seasonal average production exhibited a clear pattern—it reached its highest value, exceeding 80 kWh, during the summer, followed by spring and autumn, and its lowest, below 20 kWh, during the winter.

For the other three locations, almost the same patterns as those in the Southland location were found. However, an exception was the Whitehorn Multi-Service Centre, where the highest average yearly production was observed in 2025, even though the data only runs until the end of June 2025 (as shown in Figure 6).

4.2. Model Performance

Overall, as shown in Table 1, the forecasting performance varies by location. Glenmore Water Treatment Plant consistently achieves the lowest errors, with an average RMSE of 5.7652 and MAE of 3.5874, indicating that demand at this site is highly predictable and well captured by the model.

The Southlands Centre also demonstrated strong performance, with an average RMSE of 9.6567 and MAE of 5.9716. While errors were higher than those observed at Glenmore, the variability across time splits remained moderate, indicating reliable generalization over successive forecasting horizons. This suggests that the time-dependent patterns at Southlands are reasonably stable but subject to moderate fluctuations, which the model is largely able to capture.

In contrast, the North Corporate Warehouse exhibited greater variability across time splits, with an average RMSE of 13.5308 and MAE of 8.2982. Although the forecasting accuracy remains acceptable, these results suggest that the site would benefit from additional explanatory variables or a more specialized modelling approach to account for its higher volatility.

The Whitehorn Centre proved to be the most challenging location to forecast. It recorded the highest average RMSE (50.8785) and MAE (31.9097), with substantial fluctuations across time splits. As shown in Figure 6, 2025 has the highest average yearly production. This might be one of the reasons that the highest average RMSE was calculated for this year.

Across all locations, RMSE values are consistently higher than MAE values, which is expected because RMSE penalizes large errors more heavily. The difference between RMSE and MAE is particularly pronounced at Whitehorn Centre, implying the presence of occasional large forecast errors or outliers. Conversely, the smaller RMSE–MAE gap at Glenmore Water Treatment Plant suggests fewer extreme deviations.

In summary, the results indicate that location-specific characteristics dominate forecasting performance more than the choice of model. Glenmore Water Treatment Plant is the most predictable and operationally reliable site, followed by Southlands Centre. North Corporate Warehouse presents moderate forecasting difficulty, while Whitehorn Centre shows high uncertainty and instability, indicating a need for additional features or alternative modelling strategies to better capture its underlying dynamics.

4.3. XAI Results

Before discussing the XAI results, it is crucial to note that while explainability techniques such as SHAP, PDP, and ALE are widely used in machine learning applications, their limitations in time-series contexts must be acknowledged. These methods typically assume feature independence, an assumption that does not strictly hold for temporally correlated variables such as lagged power features. Nevertheless, such techniques remain valuable in applied forecasting studies as diagnostic and exploratory tools rather than as sources of strict causal inference. In practice, they are frequently used to identify dominant drivers, detect implausible model behaviour, support feature selection, and improve model transparency. In many real-world applications, including solar power forecasting, explainability methods are employed in conjunction with domain knowledge and physical understanding to provide meaningful insights, even if the explanations are approximate. Until time-series-specific explainability frameworks become more mature and widely validated, these established tools offer a practical compromise between interpretability and model complexity, enabling better understanding and trust in data-driven forecasting models.

Moving to the results, as shown in Figure 7, at the Southland location, there is a strong consensus across all six XAI methods regarding the most informative feature: lag_1 is consistently ranked first. This suggests that the immediate past value of the target variable is the primary driver of the LightGBM model’s predictions, a finding consistent with the time-series forecasting literature, where auto-regressive terms often dominate. Similarly, there is high agreement on the least important features, with day_of_week consistently ranked the least important (Rank 18). However, significant divergence is observed in the middle rankings, particularly regarding the Split method. For instance, hour_of_day_cos is ranked highly by Gain (Rank 2) and ALE (Rank 3) but is undervalued by Split importance (Rank 14).

The ranking results for Whitehorn Centre exhibit a similar pattern of stability, as shown in Figure 8. Here, lag_1 is universally identified as the top feature (Rank 1), while day_of_week is universally the least important (Rank 18). This unifies the interpretability landscape for this location, providing high confidence that the model relies heavily on immediate history. A notable divergence again involves the Split method. The hour_of_day_cos is ranked as a top 3 feature by five out of six methods but falls to Rank 13 in the Split method. This consistent disagreement highlights the risk of relying solely on split-frequency metrics.

As shown in Figure 9, the agreement on the dominant feature, lag_1 (Rank 1), remains unbroken. Additionally, hour_of_day_cos is consistently identified as a critical feature (Ranks 2 and 4) by Gain, SHAP, PDP, PFI, and ALE. However, the Split method again diverges, placing this feature at Rank 11. We also observe divergence regarding year: it is considered moderately important by Gain and PDP (Rank 5) and SHAP and PFI (Rank 4) but is deemed less relevant by Split (Rank 8), and Day_of_Month is consistently negligible (Rank 17).

For the Glenmore location, the results shown in Figure 10 mirror the systemic patterns seen with other locations, with lag_1 securing the top rank across all methods. Divergence is again most visible in the treatment of cyclical time features; hour_of_day_cos is a top-tier feature for Gain and PFI (Rank 2) but is downgraded by Split (Rank 11). This repeated pattern across locations confirms that the divergence is methodological rather than data-specific.

Based on the above discussion, across all four locations, the rankings reveal a remarkable systematic pattern: lag_1 is the top feature, and day_of_week is consistently the least informative. Furthermore, a clear divergence exists between Split and the other five methods. Split consistently undervalues the engineered cyclical feature hour_of_day_cos compared to others.

When multiple XAI methods agree (for example, lag_1 being a top predictor), we can be confident that these features are fundamental drivers of model predictions and worthy of operational attention. When methods diverge (e.g., hour_of_day_sin), rankings should be treated as hypothesis signals rather than definitive facts.

Finally, the presence of correlated features—such as lag_1 vs. roll_mean_4—warrants caution. High collinearity can dilute the importance of individual features in methods like PFI and PDP, as these methods may force the model to evaluate unrealistic data combinations (e.g., a high lag_1 but low roll_mean_4) [24]. ALE is designed to handle such correlations better by averaging local effects [7].

4.4. Statistical Tests Results

Statistical analysis using the Friedman test of the XAI methods across all locations failed to reject the null hypothesis that there were no significant differences between the XAI methods. However, that does not automatically prove that the XAI methods “agree” with each other.

Following the Friedman test, Spearman correlation heatmaps were generated to study the correlations between XAI methods. As shown in Figure 11, Figure 12, Figure 13 and Figure 14, there are exceptionally strong correlations (r > 0.90) between SHAP, Gain, PDP, PFI, and ALE. This high degree of consistency across all locations suggests that despite their different ways of identifying the important features, these methods identify a similar hierarchy of feature importance for the LightGBM model across the four locations.

In contrast, the Split method consistently exhibits the lowest correlation with all other methods, with Spearman coefficients dropping as low as 0.67 in the North Corporate Warehouse location. This divergence is expected; split-frequency importance is biased toward continuous variables or features that allow for many deep splits [25]. While Gain measures the quality of splits, Split only measures quantity. Therefore, the observed divergence underscores the risk of relying on split counts for feature selection.

Subsequently, Kendall’s W test was conducted to quantify the overall level of agreement among the XAI methods. The results indicate an overall strong agreement across all locations, with Kendall’s W values ranging from 0.90 to 0.94.

Given the comparatively moderate correlation observed between the Split-based ranking and the other XAI methods, Kendall’s W was recalculated after excluding the Split method. In this case, the level of agreement further increased, with W values ranging from 0.96 to 0.98, indicating near-perfect concordance among the remaining methods.

4.5. Cross-Model Validation Results

To determine whether the LightGBM results were specific to the model’s architecture or robust across different tree-based implementations, CatBoost [26] was also evaluated. Unlike LightGBM, CatBoost does not provide split-count importance; its default feature importance method is more similar to Gain.

Consistent with the LightGBM results, there was no significant difference between the XAI methods, and Kendall’s W test showed strong agreement (ranging from 0.91 to 0.94). The Spearman correlation results were similar to those achieved with LightGBM regarding SHAP, Gain, ALE, and PFI. While the correlation between PDP and Gain was 0.78 in one location, it remained between 0.81 and 0.97 in the majority of cases.

Another finding from our cross-model validation is the stability of the primary feature rankings, particularly regarding the lag_1 energy feature. Despite the fundamental differences in how LightGBM and CatBoost handle numerical and categorical data, the lag_1 feature remained the top predictor across all experimental locations and both model architectures. This consistency suggests that the dominance of historical generation data is a robust physical characteristic of solar time-series forecasting rather than an artifact of a specific algorithm’s internal logic.

Regarding the feature rank comparison between LightGBM and CatBoost (Figure 15, Figure 16, Figure 17, Figure 18 and Figure 19), the rankings are identical in many instances (nearly 40%). This is illustrated by the diagonal lines, where both models produce the same ranking for a given dataset (or a difference of zero, as shown in Figure 19). However, there are instances where the rankings differ slightly (1–2 positions, which is around 40%) or significantly (7–9 positions, which is nearly 2%).

5. Discussion

As shown in Section 4, for both LightGBM and CatBoost models, the Friedman statistical analysis showed no significant differences between XAI methods. Additionally, Kendall’s W revealed a strong to very strong agreement between XAI methods across all locations. For LightGBM, Spearman’s rank correlation showed strong correlations between the majority of XAI methods but moderate correlations between Split and other XAI methods. Similarly, high similarity was found between SHAP, Gain, ALE, and PFI, while it was moderate on one occasion between PDP and Gain.

Based on the above discussion, we suggest a multi-layered practical guide for utilizing XAI in solar energy forecasting (and other similar domains). First, we should never rely on a single XAI method—particularly split-frequency metrics—when selecting features or making operational decisions. Instead, we recommend a triangulation approach: if a feature like lag_1 is identified as critical by multiple XAI methods, it can be treated as a ground-truth operational driver with high confidence. However, when methods diverge—such as when a feature is ranked high by one method but low by another—we should treat that feature as a hypothesis signal rather than a fact. In such cases, the divergence often points to complex feature interactions or high collinearity that requires further empirical validation through model retraining. Finally, for robust feature selection, we suggest using median rank or the Borda count [27] across a suite of at least three XAI techniques (e.g., SHAP, ALE, and Gain) to neutralize individual method biases and ensure stable insights.

6. Conclusions

This study provides a rigorous empirical foundation for validating XAI consistency, demonstrating that while diverse XAI frameworks largely reach a consensus on feature importance, significant pockets of divergence exist that can lead to misinformed operational strategies.

Our findings yield three critical takeaways:

Relying on a single XAI technique can be risky. Future workflows should adopt a triangulated approach, using at least three distinct XAI categories (e.g., SHAP, ALE, and Gain) to ensure explanation stability.
We suggest that Kendall’s W and Spearman’s rank correlations should be adopted as diagnostic metrics when comparing feature-ranking XAI methods. A low consensus score between XAI methods serves as a red flag, indicating a lack of agreement in the relative importance of features and suggesting that the feature-ranking explanations may be unstable or method-specific rather than reflecting a robust, model-agnostic pattern.
Through cross-validation with LightGBM and CatBoost, this research confirms that certain feature hierarchies (such as the dominance of the lag-1 variable) are architecture-independent. This suggests that XAI can uncover universal domain truths, provided that the practitioner accounts for method-specific biases.

In summary, we demonstrate that explanation consistency across XAI methods cannot be assumed and must instead be empirically verified. Therefore, embedding rank-consensus diagnostics into standard XAI workflows provides a practical mechanism for operationalizing this shift and for distinguishing genuinely informative explanations from those that merely appear plausible.

Author Contributions

Conceptualization, K.T.T. and W.S.; methodology, K.T.T. and W.S.; software, K.T.T. and W.S.; validation, K.T.T. and W.S.; formal analysis, K.T.T.; investigation, K.T.T.; data curation, K.T.T.; writing—original draft preparation, K.T.T.; writing—review and editing, K.T.T. and W.S.; visualization, K.T.T. and W.S.; supervision, W.S.; project administration, K.T.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used can be downloaded from https://data.calgary.ca/Environment/Solar-Energy-Production/ytdn-2qsp/about_data, accessed on 10 July 2025.

Acknowledgments

During the preparation of this study, the first author used ChatGPT, GPT-5, for the purposes of translation, clarity, and conciseness in certain sections, as well as for debugging some parts of the codes. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hokmabad, H.N.; Husev, O.; Belikov, J. Day-ahead Solar Power Forecasting Using LightGBM and Self-Attention Based Encoder-Decoder Networks. IEEE Trans. Sustain. Energy 2024, 16, 866–879. [Google Scholar] [CrossRef]
Biswal, B.; Deb, S.; Datta, S.; Ustun, T.S.; Cali, U. Review on smart grid load forecasting for smart energy management using machine learning and deep learning techniques. Energy Rep. 2024, 12, 3654–3670. [Google Scholar] [CrossRef]
Petrosian, O.; Zhang, Y. Solar Power Generation Forecasting in Smart Cities and Explanation Based on Explainable AI. Smart Cities 2024, 7, 3388–3411. [Google Scholar] [CrossRef]
Saeed, W.; Omlin, C. Explainable AI (XAI): A systematic meta-survey of current challenges and future opportunities. Knowl. Based Syst. 2023, 263, 110273. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Molnar, C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable, 2nd ed.; Leanpub: Victoria, BC, Canada, 2022. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Apley, D.W.; Zhu, J. Visualizing the effects of predictor variables in black box supervised learning models. J. R. Stat. Soc. Ser. B Stat. Methodol. 2020, 82, 1059–1086. [Google Scholar] [CrossRef]
Munir, S.; Pradhan, M.R.; Abbas, S.; Khan, M.A. Energy consumption prediction based on lightgbm empowered with explainable artificial intelligence. IEEE Access 2024, 12, 91263–91271. [Google Scholar] [CrossRef]
Agrawal, A.; Sazos, M.; Al Durra, A.; Maniatakos, M. Towards robust power grid attack protection using lightgbm with concept drift detection and retraining. In Proceedings of the 2020 Joint Workshop on CPS&IoT Security and Privacy; Association for Computing Machinery: New York, NY, USA, 2020; pp. 31–36. [Google Scholar] [CrossRef]
Zhao, R.; Zhang, L.; Li, Z. Identification of Financial Fraud in Listed Companies Based on Bayesian-LightGBM Model. In Proceedings of 2024 2nd International Conference on Artificial Intelligence, Systems and Network Security, AISNS 2024; Association for Computing Machinery, Inc.: New York, NY, USA, 2025; pp. 339–344. [Google Scholar] [CrossRef]
Chaibi, M.; Benghoulam, E.M.; Tarik, L.; Berrada, M.; Hmaidi, A.E. An interpretable machine learning model for daily global solar radiation prediction. Energies 2021, 14, 7367. [Google Scholar] [CrossRef]
Kuzlu, M.; Cali, U.; Sharma, V.; Güler, Ö. Gaining insight into solar photovoltaic power generation forecasting utilizing explainable artificial intelligence tools. IEEE Access 2020, 8, 187814–187823. [Google Scholar] [CrossRef]
Nallakaruppan, M.K.; Shankar, N.; Bhuvanagiri, P.B.; Padmanaban, S.; Khan, S.B. Advancing solar energy integration: Unveiling XAI insights for enhanced power system management and sustainable future. Ain Shams Eng. J. 2024, 15, 102740. [Google Scholar] [CrossRef]
Park, J.; Moon, J.; Jung, S.; Hwang, E. Multistep-ahead solar radiation forecasting scheme based on the light gradient boosting machine: A case study of Jeju Island. Remote Sens. 2020, 12, 2271. [Google Scholar] [CrossRef]
Tang, F.; Ishwaran, H. Random forest missing data algorithms. Stat. Anal. Data Min. ASA Data Sci. J. 2017, 10, 363–377. [Google Scholar] [CrossRef]
Chakraborty, D.; Elzarka, H. Advanced machine learning techniques for building performance simulation: A comparative analysis. J. Build. Perform. Simul. 2019, 12, 193–207. [Google Scholar] [CrossRef]
Kwiatkowski, D.; Phillips, P.C.; Schmidt, P.; Shin, Y. Testing the null hypothesis of stationarity against the alternative of a unit root: How sure are we that economic time series have a unit root? J. Econom. 1992, 54, 159–178. [Google Scholar] [CrossRef]
Dickey, D.A.; Fuller, W.A. Distribution of the estimators for autoregressive time series with a unit root. J. Am. Stat. Assoc. 1979, 74, 427–431. [Google Scholar] [PubMed]
Friedman, M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 1937, 32, 675–701. [Google Scholar] [CrossRef]
Nemenyi, P.B. Distribution-Free Multiple Comparisons; Princeton University: Princeton, NJ, USA, 1963. [Google Scholar]
Hooker, G.; Mentch, L.; Zhou, S. Unrestricted permutation forces extrapolation: Variable importance requires at least one more model, or there is no free variable importance. Stat. Comput. 2021, 31, 82. [Google Scholar] [CrossRef]
Strobl, C.; Boulesteix, A.L.; Zeileis, A.; Hothorn, T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinform. 2007, 8, 25. [Google Scholar] [CrossRef] [PubMed]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
Varghese, A.S.; Somasundaram, G.; Nambiar, A. On Developing Explainable AI Evaluation Metrics for Image Classification using Borda Count and Multiple Correlation Techniques. ACM Trans. Intell. Syst. Technol 2025. [Google Scholar]

Figure 1. First-order differenced series of KPSS.

Figure 2. Sudden and extreme spikes in late 2023 in Southland.

Figure 3. Anomalies found after dividing by 1000 (unit conversion) in Southland.

Figure 4. Solar production after imputation with random forest in Southland.

Figure 5. Average energy production by (a) hour of day; (b) day of month; (c) day of week; (d) weekdays vs. weekends; (e) month; (f) year; (g) season.

Figure 6. Average production by month and year in Whitehorn Multi-Service Centre.

Figure 7. Ranking comparison of all XAI methods for Southlands Centre.

Figure 8. Ranking comparison of all XAI methods for Whitehorn Centre.

Figure 9. Ranking comparison of all XAI methods for North Corporate Warehouse.

Figure 10. Ranking comparison of all XAI methods for Glenmore Water Treatment Plant.

Figure 11. Spearman’s rank correlation heatmap (Southland).

Figure 12. Spearman’s rank correlation heatmap (North Corporate Warehouse).

Figure 13. Spearman’s rank correlation heatmap (Whitehorn Service Center).

Figure 14. Spearman’s rank correlation heatmap (Glenmore).

Figure 15. Feature rank comparison between LightGBM and CatBoost (Southlands Centre). Different colors are used to represent distinct features.

Figure 16. Feature rank comparison between LightGBM and CatBoost (Whitehorn Centre). Different colors are used to represent distinct features.

Figure 17. Feature rank comparison between LightGBM and CatBoost (North Corporate Warehouse). Different colors are used to represent distinct features.

Figure 18. Feature rank comparison between LightGBM and CatBoost (Glenmore). Different colors are used to represent distinct features.

Figure 19. Differences in feature ranking between LightGBM and CatBoost.

Table 1. Forecasting performance across locations of the best LightGBM model.

Split	Metric	Southlands Centre	Whitehorn Centre	North Corporate Warehouse	Glenmore Water Treatment Plant
1	RMSE	15.5179	39.7431	15.6982	9.6990
1	MAE	9.7015	23.8118	10.2982	6.1004
2	RMSE	5.9587	64.8073	13.1643	3.4459
2	MAE	3.6308	45.3081	8.1971	2.0299
3	RMSE	9.6111	50.7185	14.2117	4.6768
3	MAE	5.5653	32.6921	7.1421	2.7397
4	RMSE	11.9866	65.4772	9.1897	5.5152
4	MAE	7.3637	42.7564	5.8265	3.3064
5	RMSE	6.7854	46.6546	11.0693	5.1042
5	MAE	4.1379	27.8820	6.5250	3.4573
6	RMSE	6.0813	53.9753	7.1280	6.3114
6	MAE	4.0014	32.9228	4.1773	4.1725
7	RMSE	10.2782	39.5914	9.7129	5.0726
7	MAE	6.3272	23.4742	6.8736	2.8919
8	RMSE	5.7738	42.5065	18.3104	5.0714
8	MAE	3.4477	23.7408	8.6614	3.3199
9	RMSE	13.9398	55.8495	19.9058	7.2009
9	MAE	8.8078	34.4194	13.6641	4.4056
10	RMSE	10.6346	49.4617	16.9181	5.5551
10	MAE	6.7328	32.0890	11.6170	3.4500
Average	RMSE	9.6567	50.8785	13.5308	5.7652
Average	MAE	5.9716	31.9097	8.2982	3.5874

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Thinn, K.T.; Saeed, W. Consensus and Divergence in Explainable AI (XAI): Evaluating Global Feature-Ranking Consistency with Empirical Evidence from Solar Energy Forecasting. Mathematics 2026, 14, 297. https://doi.org/10.3390/math14020297

AMA Style

Thinn KT, Saeed W. Consensus and Divergence in Explainable AI (XAI): Evaluating Global Feature-Ranking Consistency with Empirical Evidence from Solar Energy Forecasting. Mathematics. 2026; 14(2):297. https://doi.org/10.3390/math14020297

Chicago/Turabian Style

Thinn, Kay Thari, and Waddah Saeed. 2026. "Consensus and Divergence in Explainable AI (XAI): Evaluating Global Feature-Ranking Consistency with Empirical Evidence from Solar Energy Forecasting" Mathematics 14, no. 2: 297. https://doi.org/10.3390/math14020297

APA Style

Thinn, K. T., & Saeed, W. (2026). Consensus and Divergence in Explainable AI (XAI): Evaluating Global Feature-Ranking Consistency with Empirical Evidence from Solar Energy Forecasting. Mathematics, 14(2), 297. https://doi.org/10.3390/math14020297

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Consensus and Divergence in Explainable AI (XAI): Evaluating Global Feature-Ranking Consistency with Empirical Evidence from Solar Energy Forecasting

Abstract

1. Introduction

2. Related Works

2.1. Explainable AI

2.2. LightGBM

2.3. XAI in Solar Energy

3. Materials and Methods

3.1. Data Collection

3.2. Data Pre-Processing

Missing Hour Imputation

3.3. Feature Engineering

3.3.1. Temporal and Seasonal Features

3.3.2. Cyclical Features

3.3.3. Rolling Average

3.3.4. Lag Effects

3.4. LightGBM Predictive Model Implementation

3.5. Performance Metrics

3.6. XAI Integration

4. Results

4.1. Trend Analysis

4.2. Model Performance

4.3. XAI Results

4.4. Statistical Tests Results

4.5. Cross-Model Validation Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI