1. Introduction
Pakistan has grappled with a significant energy crisis since 2004, driven by several factors, including the absence of advanced modeling tools in power planning and policy-making, over-reliance on imported energy resources, and inadequate governance [
1]. This crisis has particularly impacted the province of Punjab, which is home to over 90 million people and consumes 68% of the country’s electricity. With an annual demand growth of 6–8%, Punjab faces a persistent demand–supply gap of 4000 MW, leading to severe economic repercussions and disruptions in citizens’ daily lives. Addressing this energy deficit is critical for the region’s sustainable development and economic stability.
In response to this crisis, the government has prioritized mitigating power deficits by investing in the energy sector and promoting renewable energy initiatives to fully leverage the country’s sustainable electricity potential [
2]. Renewable energy sources, such as solar, wind, and hydro, offer sustainable and abundant alternatives to finite fossil fuels. These alternatives not only enhance energy security and reduce geopolitical tensions associated with energy dependence but also play a vital role in addressing climate change. Furthermore, renewable energy improves air quality and drives economic growth by creating employment opportunities in manufacturing, installation, operation, and maintenance. These benefits underscore the importance of transitioning to renewable energy systems to ensure long-term energy sustainability.
Technological advancements and economies of scale have significantly reduced the cost of renewable energy production, making it increasingly competitive with conventional energy sources. The global transition to renewable energy reveals significant disparities in adoption rates across regions, with the EU, China, and India leading in wind and solar advancements [
3], due to prices being more stable over time compared to the volatile prices of fossil fuels. Despite these advantages, Pakistan still heavily relies on non-renewable resources such as coal, oil, and natural gas, which dominate the country’s electricity generation. This reliance highlights the urgent need to accelerate the adoption of renewable energy technologies to diversify the energy mix and reduce dependency on fossil fuels.
Among the renewable energy options, recent advances in solar photovoltaic (PV) materials and systems have enhanced efficiency, reduced costs, and improved energy storage, making PV a viable renewable energy solution [
4]. The Cholistan Desert, with its vast high-irradiation areas, offers an ideal location for large-scale solar PV deployment, providing a viable pathway to address the energy crisis effectively and sustainably. However, despite this immense potential, the solar energy sector in Pakistan remains in its nascent stages, presenting both challenges and opportunities for growth [
5]. The increasing global demand for renewable energy has introduced complexities in assessing site suitability and evaluating the technical potential for solar installations. Optimal spatial location selection for utility-scale PV systems is critical to maximizing the benefits of solar resources while addressing the inherent variability of solar energy [
6].
While Geographic Information Systems (GIS) and machine learning (ML) have been widely adopted for renewable energy site selection, existing studies often rely on fragmented approaches that prioritize theoretical datasets or fail to integrate multi-dimensional, field-verified inputs. This limitation is particularly evident in hyper-arid regions like the Cholistan Desert, Pakistan, where extreme environmental conditions (e.g., temperature gradients, seasonal variability) and dynamic land-use patterns pose unique challenges. Additionally, the integration of GIS and ML for solar photovoltaic (PV) site selection remains underexplored, especially in understudied desert ecosystems. Current methodologies lack adaptability to real-time environmental fluctuations and often overlook the interpretability of ML models, limiting their practical relevance and scalability. This gap highlights the need for an integrated framework that combines spatially explicit ground-truth data with advanced ML algorithms to enhance prediction accuracy, transparency, and long-term project viability in complex environments.
The primary research question guiding this study is as follows: How can an integrated GIS–ML framework, leveraging spatially explicit ground-truth data and advanced machine learning algorithms (Random Forest, XGBoost, and Multilayer Perceptron), improve the precision, interpretability, and adaptability of solar PV site selection in hyper-arid regions like the Cholistan Desert, Pakistan? To address this, this study aims to develop and validate an integrated GIS–ML framework tailored to the unique challenges of hyper-arid environments. Specifically, it seeks to synthesize spatially explicit ground-truth data with advanced ML algorithms to enhance prediction accuracy and adaptability, incorporate SHAP (SHapley Additive exPlanations) for transparent feature contribution analysis, and unify adaptive geospatial modeling with temporal dynamics to account for extreme environmental conditions and ensure long-term project viability. By providing a scalable, context-sensitive solution, this framework advances precision in site selection and supports sustainable solar energy development in understudied desert ecosystems, bridging the gap between theoretical datasets and practical applications.
The remaining part of the paper is structured as follows.
Section 2 reviews literature on GIS–ML approaches in renewable energy planning, contextualizing global trends and regional challenges.
Section 3 outlines the methodology, including the Cholistan Desert study area and data sources for solar PV site evaluation.
Section 4 presents results, mapping high-potential zones and validating outcomes via geospatial/statistical models.
Section 5 discusses strategies for addressing policy, economic, and environmental factors. The conclusion underscores how data-driven analytics can bridge renewable energy potential and practical deployment, offering scalable solutions for energy security in resource-constrained regions like Punjab.
2. Literature Review
Identifying optimal locations for solar photovoltaic (PV) installations is a critical step in advancing renewable energy strategies, requiring a comprehensive evaluation of ecological, technical, economic, and social factors. The integration of Geographic Information Systems (GIS) and machine learning (ML) methods provides a structured and robust approach to addressing the complexities inherent in site suitability assessments. The proliferation of Geographic Information System (GIS) platforms [
7,
8] has revolutionized spatial analysis for renewable energy projects, enabling a systematic assessment of solar and wind farm feasibility and the identification of optimal site configurations. GIS facilitates the integration, processing, and visualization of multi-source geospatial datasets—such as solar irradiance, land cover, and topography—to support data-driven decision-making in photovoltaic (PV) infrastructure development. By coupling GIS with economic models, researchers can quantify grid-connected technical potential and analyze cost-benefit dynamics for solar energy generation, enhancing the precision of large-scale project planning.
A critical application of GIS lies in evaluating land suitability for utility-scale PV installations. Advanced methodologies incorporating exclusion criteria (e.g., environmental sensitivities, slope limitations) and spatial constraints (e.g., proximity to grids, protected areas) have proven instrumental in minimizing ecological and operational risks [
9]. Furthermore, GIS-based multi-criteria decision-making (MCDM) frameworks are increasingly adopted to assess site suitability and technical capacity for renewable energy projects. These frameworks synthesize environmental, economic, and social variables, offering stakeholders actionable insights for prioritizing high-potential locations and streamlining regulatory compliance [
10]. Such tools also enable the integration of dynamic parameters like solar radiation patterns and land-use changes, reinforcing GIS’s role as a cornerstone of sustainable energy planning [
11].
Machine learning (ML) has emerged as a transformative paradigm for addressing complex spatial classification and regression challenges in renewable energy siting [
12]. ML algorithms excel in processing heterogeneous datasets, with applications spanning soil property mapping, biodiversity conservation, land-use classification, and energy infrastructure optimization. For instance, this study [
13] demonstrated the efficacy of ensemble ML techniques in identifying optimal locations for waste-to-energy facilities, generating high-resolution suitability maps while isolating critical siting parameters like population density and transportation networks. Similarly, this study [
14] employed ML models to map wind turbine suitability in Iowa, USA, validating their utility in balancing energy output with environmental and land-use constraints.
These advancements underscore how machine learning (ML) complements GIS-driven spatial analysis by enhancing predictive accuracy and adaptability in clean energy infrastructure planning. Building on GIS frameworks that evaluate land suitability and technical potential (e.g., [
9]), ML algorithms such as Support Vector Regression (SVR), decision trees, and Random Forests now enable a deeper analysis of multidimensional geospatial data. By systematically integrating resource availability (e.g., solar irradiance), microclimatic variability, regulatory constraints, and socio-economic indicators, ML refines predictive models for solar PV deployment, optimizing both site selection precision and long-term project viability [
15].
The integration of ML with GIS platforms addresses a critical gap identified in earlier studies: the need for dynamic, data-driven adaptability in renewable energy planning. For instance, while traditional GIS methodologies rely on static criteria (e.g., [
11]), ML–GIS hybrid models uncover latent patterns in large-scale datasets, such as shifting climatic trends or evolving land-use dynamics, to improve decision robustness [
16,
17,
18]. This synergy enables predictive analytics that adapt to real-time environmental fluctuations, ensuring that renewable energy projects remain resilient under changing conditions.
Existing photovoltaic (PV) site suitability assessments in arid regions frequently employ fragmented analytical approaches—relying solely on geospatial tools or machine learning (ML) and prioritizing theoretical datasets over multi-dimensional, field-verified inputs. Such methods inadequately address hyper-arid environmental complexities, including extreme temperature gradients, and seasonal resource variability, limiting their practical relevance in regions like the Cholistan Desert, Pakistan. To resolve these shortcomings, this study introduces an integrated framework that synthesizes spatially explicit ground-truth data with three advanced ML algorithms: Random Forest, XGBoost, and a Multilayer Perceptron (MLP) neural network. Leveraging ArcGIS 10.8 for spatial data processing and SHAP (SHapley Additive exPlanations) for interpretability, the methodology ensures both high prediction accuracy and transparent feature contribution analysis. By unifying adaptive geospatial modeling with temporal dynamics, the framework delivers a scalable, context-sensitive solution tailored to the unique solar energy challenges of understudied desert ecosystems, advancing precision in site selection and long-term project viability.
The main contributions of this study are threefold. First, it introduces a hybrid GIS–ML framework that integrates geospatial analysis with advanced machine learning algorithms (Random Forest, XGBoost, and MLP neural networks), overcoming the limitations of isolated methodologies. This synergy enhances adaptability to dynamic desert-specific challenges. Second, the study pioneers the use of ground-truth data, including solar potential, environmental, and socio-economic indicators, to empirically validate models in hyper-arid regions a critical advancement beyond theoretical or single-source datasets. Third, it establishes interpretable and scalable decision-making through SHAP-driven model transparency, clarifying feature contributions (e.g., environmental, solar irradiance) while ensuring replicability for other arid ecosystems. Collectively, these innovations provide a precision-driven, context-sensitive solution for sustainable solar energy planning in understudied desert environments in Pakistan.
3. Materials and Methods
3.1. Study Area
The study focused on the Cholistan Desert Punjab, Pakistan as the designated area of interest, covering an area of approximately 25,900 square kilometers. This desert lies between latitudes 27°42′ to 29°45′ and longitudes 69°52′ to 75°24′ [
19]. It supports a population exceeding 0.3 million residents and sustains around 2.0 million livestock. The land in Cholistan is utilized for various purposes, including grasslands, agriculture, built-up areas, and barren terrain. The desert stretches roughly 480 km in length, with a width that varies between 32 and 192 km.
The Cholistan Desert is divided into two regions: greater Cholistan, encompassing an area of 18,130 square kilometers, and lesser Cholistan, covering 7770 square kilometers. Situated in southwest Punjab, Pakistan, it spans three districts: Bahawalpur, Bahawalnagar, and Rahim Yar Khan [
20]. The region’s climate is classified as arid to semi-arid, characteristic of tropical desert environments, and is marked by very low annual humidity. The average yearly temperature is 28.33 °C (82.99 °F), with July being the hottest month, recording a mean temperature of 38.5 °C (101.3 °F).
The region experiences abundant sunshine throughout the year, making it highly suitable for solar photovoltaic (PV) systems. This climatic advantage presents significant potential for harnessing solar energy to meet Punjab’s increasing energy needs while addressing environmental challenges. The study highlights the strategic role of utilizing the country’s solar resources to optimize energy generation and contribute to sustainable energy solutions.
Figure 1 provides a spatial representation of the study area, offering essential geographical context for this research.
3.2. Data Collection
A detailed geospatial dataset was developed to identify optimal locations for the installation of solar photovoltaic (PV) plants. This dataset was constructed using field survey data collected from 355 location points, including both PV and non-PV sites. These points were meticulously analyzed using ArcGIS software to process and evaluate critical variables for PV site selection, considering 14 distinct features such as physical-geographical, socio-economic, and resource conditioning factors. The raster values for these features were extracted using the Extract Values to Points tool in ArcGIS 10.8, with all conditioning factor maps converted into raster format at a spatial resolution of 1 km. To ensure compatibility among datasets with differing resolutions, a multi-step procedure was implemented, resampling all datasets to a uniform 1 km resolution using the nearest neighbor method in ArcGIS 10.8. The integration of advanced spatial analytics and field-derived data ensures the dataset’s reliability for sustainable energy planning and decision-making.
Solar irradiation data were downloaded from
https://solargis.com and accessed on 31 October 2024 with a resolution of 250 m. The Digital Elevation Model (DEM) was obtained from the USGS repository
https://www.usgs.gov/ and accessed on 18 November 2024 with a resolution of 30 m. Vegetation analysis was conducted using the Normalized Difference Vegetation Index (NDVI), derived from Sentinel-2 imagery (COPERNICUS/S2), with Near-Infrared (NIR) and red bands obtained from
http://livingatlas.arcgis.com and accessed on 20 November 2024 at a resolution of 30 m. These datasets were upsampled to a uniform resolution of 1 km for consistency.
Infrastructure and socio-economic considerations were addressed using road network and population density data, which were critical for evaluating the proximity of potential sites to transportation routes and population centers. Both datasets were retrieved from
https://data.humdata.org and accessed on 11 and 3 November 2024 respectively, had a spatial resolution of 1 km, remaining unchanged as they already matched the target resolution. Wind speed data, sourced from the ERA5 Monthly Aggregates dataset (ECMWF/ERA5/MONTHLY), was computed by combining the u- and v-components of wind measured at 10 m above the surface. This dataset, with a spatial resolution of 1 km, was obtained through
https://earthengine.google.com and accessed on 31 October 2024, and also remained unchanged. Temperature data were obtained from
https://solargis.com and accessed on 31 October 2024 with a spatial resolution of 1 km. This process ensured that all datasets could be effectively integrated into the site suitability analysis, providing a coherent and consistent basis for the machine learning models. The summary of detailed indicators is shown in
Figure 2 and listed in
Table 1.
4. Methodology
It is presumed that the PV power station installation enjoys a fairly optimal geographical position, as determined collaboratively by investment decision-makers and experts. The overall process comprises the following steps. Initially, ground-truth data were gathered via field surveys conducted across the Cholistan Desert Punjab, Pakistan. Subsequently, paired non-PV installation points were randomly generated using the spatial buffer sampling method. A total of 14 conditioning factors, including physical geography, proximity, and solar resources, were identified through a literature review and extracted from multi-source datasets by using ArcGIS 10.8 software. Next, the combined dataset of dependent and independent indicators was randomly selected and split into a 70/30 ratio. Specifically, 70% of the dataset was utilized for model building, while the remaining 30% was reserved for model validation.
Following this, three widely used machine learning techniques—Multilayer Perceptron (MLP), Random Forest (RF), and Extreme Gradient Boosting (XGBoost) models—were employed for evaluation. Ultimately, the most robust model was selected to predict the suitability of PV installations throughout the desert. The relative variable importance diagram (SHAP) from the chosen robust model was then utilized to evaluate the marginal contribution and direction of each variable in relation to PV location selection. The methodological framework is comprehensively detailed in
Figure 3.
4.1. Multilayer Perceptron Neural Network (MLP NN)
The MLP network is one of the most applicable feed-forward neural network techniques for modeling and prediction of the real world, thus it has been used as the benchmark model in many fields. We implemented a Multilayer Perceptron (MLP) that consists of three hidden layers with 256, 128, and 64 neurons, respectively. The Rectified Linear Unit (ReLU) [
32] activation function was used for the hidden layers.
To optimize the MLP model, we employed the Adam optimizer [
33], which combines the benefits of momentum and adaptive learning rates. The optimization process of the Adam algorithm is given by the Equations (
1)–(
4) [
34].
where
and
are the biased first and second moment estimates,
and
are the decay rates,
is the learning rate, and
is a small constant for numerical stability.
ReLU introduces non-linearity while mitigating the vanishing gradient problem. To prevent overfitting, early stopping is employed with a validation fraction of 10% of the training data. Additionally, regularization () adds stability to the model. The parameters of the MLP are optimized through fine-tuning, and the most suitable values are subsequently selected.
4.2. Random Forest
Random Forest (RF) is a robust ensemble learning approach that enhances predictive accuracy by constructing numerous decision trees and aggregating their outputs. It is particularly effective in addressing overfitting and variance, making it suitable for datasets with complex non-linear relationships, high-dimensional features, and minimal pre-processing requirements. In classification tasks, RF leverages the principle of combining multiple uncorrelated decision trees to achieve more reliable predictions.
Specifically, RF constructs regression trees by recursively splitting the data into subsets, optimizing a criterion such as the residual sum of squares at each step. Each split divides the feature space into regions represented as nodes, which are further divided until a predefined stopping condition is met, such as a minimum number of observations in terminal nodes. For regression tasks, predictions are derived by averaging the outputs across all trees, enabling the estimation of solar PV installation potential through the aggregated results.
We employed the Random Forest (RF) algorithm as an ensemble learning approach for classification tasks. In our study, we utilized 500 decision trees (
) [
35], with the
class_weight parameter set to
balanced to handle class imbalance in the dataset. We also fixed the random seed (
random_state=123) to ensure the reproducibility of our results. During inference, class probabilities were calculated as the mean of the probabilities estimated by all decision trees in the ensemble:
where
N is the total number of decision trees in the ensemble.
The predicted class
for input
x is determined by selecting the class with the highest aggregated probability:
This ensemble approach is highly effective in classification tasks, particularly in scenarios involving imbalanced datasets and complex feature spaces [
36]. Equations (
5) and (
6) describe the process of calculating class probabilities and determining the predicted class in the Random Forest algorithm.
4.3. Extreme Gradient Boosting Model (XGBoost)
The Extreme Gradient Boosting (XGBoost) algorithm was proposed by [
37]. It combines multiple weakly supervised models to produce a strong supervised model. The XGBoost can help to reduce overfitting and perform better prediction accuracy. Meanwhile, XGBoost is not influenced by multi-collinearity, and all influential features in the model may be retained, even if some of them were in strong correlation with one another. XGBoost makes a second-order Taylor expansion of the loss function and adds a regular term to the loss function to find the optimal solution, balance the decline of the loss function and the complexity of the model, and avoid overfitting.
We applied the Extreme Gradient Boosting (XGBoost) algorithm as a second ensemble learning method. XGBoost builds an additive ensemble of weak learners (decision trees) in a sequential manner, where each new tree minimizes the residual errors of the previously constructed ensemble. To achieve this, the algorithm optimizes the following objective function is given by the Equation (
7).
where
is the logarithmic loss function, and
is the regularization term. The complete objective function is given by the Equation (
8):
In our study, we used XGBoost for classification tasks. The loss function used is the logarithmic loss, which is represented in Equations (
9):
where
is the loss function,
is the true label, and
is the predicted probability for the
i-th instance.
The model is regularized by penalizing the complexity, which is controlled by the number of leaves (
T) and their weights (
w) in each tree. The regularization term is given by the Equation (
10):
where
is a parameter that controls the complexity by penalizing the number of leaves,
represents the weight of the
j-th leaf, and
is a regularization parameter that penalizes large weights to prevent overfitting.
To address a class imbalance in the dataset, we used the scale_pos_weight parameter, which assigns a higher weight to the minority class during training to handle imbalanced classes effectively.
We used 500 boosting rounds () with a fixed random seed () for consistency. XGBoost leverages second-order gradients (Hessian information) for efficient optimization, and its implementation supports parallel processing to improve computational efficiency.
4.4. Model Settings
The experiments were performed in a computational environment running Windows 10 Pro (64-bit), powered by an Intel Core i7-6820HQ CPU operating at 2.70 GHz (8 CPUs) and supported by an NVIDIA GeForce RTX 3090 GPU server to ensure efficient and high-performance computation. For coding and implementation, we utilized Visual Studio Code (VS Code) as the integrated development environment (IDE), with Python as the primary programming language.
4.5. Models Evaluation
A total of 355 samples, which include both PV and non-PV location points, were randomly divided into two groups, with 70% allocated for training and 30% for testing. For each type of machine learning model, receiver operating characteristic (ROC) curves were constructed to assess and compare the effectiveness of various data-driven models. The ROC curve illustrates the false positive rate on the x-axis versus the true positive rate on the y-axis. This curve has become a widely accepted method for gauging the overall performance of classification models [
38]. Additionally, the area under the ROC curve (AUC) serves as a quantitative indicator of a model’s quality. Equation (
11) is used for computing the ROC:
In this study, we define
P data points as representing solar PV installation locations, while
N denotes the data points for non-solar PV installation sites. True positive (TP) and true negative (TN) refer to the correctly classified solar PV installation locations and non-solar PV sites, respectively. The classification accuracy (CA) is calculated as the ratio of correctly classified samples, which is based on the 30% validation dataset. CA values are calculated by Equation (
12). A higher CA value suggests better model performance.
where
represents the number of pixels correctly identified as solar PV installation locations,
denotes the number of pixels misclassified as non-solar PV installation sites,
is the count of pixels correctly classified as non-solar PV sites, and
refers to the number of pixels incorrectly assigned to solar PV installation locations. Accuracy serves as a metric to assess the performance of binary classification tasks by evaluating the rate of correct identifications [
17].
The SHAP (SHapley Additive exPlanations) method, as discussed by [
39,
40], is a robust technique for interpreting machine learning models, particularly tree-based models. SHAP quantifies the contribution of each feature to the model’s predictions, offering valuable insights into how different features influence the outcomes. While SHAP can evaluate the effect of features on individual predictions, it also provides a way to estimate feature importance. In this study, we focused on calculating the feature importance values rather than assessing the direct influence of individual features on the model’s output.
5. Results and Discussion
5.1. Evaluation of Three Models Applied
An assessment and comparison were conducted on three frequently utilized machine learning algorithms using statistical metrics; AUC, CA, and ROC curves are shown in
Figure 4. The optimal model results are displayed in
Table 2 and
Figure 5. Overall, the Random Forest (RF) models demonstrated strong performance in modeling the locations of photovoltaic (PV) power plants, with an AUC exceeding
. In contrast, the Extreme Gradient Boosting (XGBoost) models exhibited the lowest performance scores, having an AUC value
, which was lower than that of other models. Regarding classification accuracy, the RF models also surpassed the other ML models, boasting a CA value of
. The RF prediction models achieved commendable results in forecasting PV installation locations. Thus, the RF model holds significant potential for modeling PV power plant locations in the study area.
5.2. Feature Importance Analysis
To better interpret the machine learning (ML) modeling results, a feature importance analysis was conducted across three machine learning models—Random Forest (RF), Multilayer Perceptron (MLP), and XGBoost (XGB). The analysis reveals consistent key factors influencing photovoltaic (PV) power plant location predictions, as illustrated in
Figure 6,
Figure 7 and
Figure 8. In the Random Forest (RF) model, several key features were identified as significantly influencing the suitability of PV installations. The most influential factors include Global Horizontal Irradiance (GHI) and population density, followed by other critical factors such as aspect, distance to residential areas, and elevation. These findings suggest that geographic elevation and local climate conditions play a pivotal role in determining optimal locations for PV installations. Additionally, other important features such as Normalized Difference Vegetation Index (NDVI), PM10 concentration, temperature, wind speed, and slope emerged as significant considerations. These elements underscore the importance of solar irradiance levels and geographical factors in assessing the potential of a site for PV installations. However, variables such as distance to road exhibited a less pronounced impact on PV site suitability, allowing for a more focused evaluation of the most relevant factors.
In the context of the Multilayer Perceptron (MLP) model, Global Horizontal Irradiance (GHI) has been identified as the most critical factor, underscoring the pivotal role of environmental factors and terrain characteristics in determining the suitability of photovoltaic (PV) installations. Subsequently, factors such as elevation and wind speed are also found to play significant roles, further emphasizing the impact of local geographic and atmospheric conditions on PV potential. The MLP model indicates that slope, distance to residential areas, and aspect are recognized as less influential factors. Among these, population density, temperature, and distance to roads are identified as the least impactful factors in the model’s assessment of PV suitability. This finding suggests that, while certain environmental and infrastructural factors are more influential in determining PV installation suitability, others have a relatively minor effect on the model’s predictions.
In the context of the XGBoost (XGB) model, the most dominant feature, as determined by the XGBoost model, is the aspect, which is a critical geographical attribute influencing the efficiency of solar energy capture. This factor’s prominence in the model’s output underscores the importance of the landscape’s orientation in maximizing solar energy yield. Subsequently, Global Horizontal Irradiance (GHI) and population density are identified as significant contributors to the model’s predictions. The inclusion of GHI reflects the direct impact of solar energy availability on PV potential, while population density serves as an indicator of the accessibility and the potential for infrastructure development in the area. While these factors are the most influential, the XGBoost model also considers temperature, distance to residential areas, elevation, and wind speed as additional variables that affect PV suitability. However, these factors are ranked lower in terms of importance, suggesting that they have a lesser impact on the model’s predictions.
We noticed that the “sunshine duration” feature was expected to be of great importance, yet its importance was negligible in all three models due to several factors. First, the feature may be redundant with Global Horizontal Irradiance (GHI) data, as GHI directly measures the amount of solar energy available, likely capturing essential information about solar potential and overshadowing the contribution of sunshine duration and reducing the independent contribution of sunshine duration in the model’s decision-making process. Second, the accuracy and resolution of the ’sunshine duration’ data might not be sufficient to provide meaningful insights, as inaccuracies or a lack of granularity could diminish its predictive power. Third, ’sunshine duration’ might be highly correlated with other features, such as temperature, leading the model to attribute predictive power to those features instead. Finally, the machine learning models, particularly Random Forest, XGBoost, and MLP, are capable of handling complex interactions and selecting the most relevant features; if ’sunshine duration’ does not added unique predictive value beyond what is already captured by other features, the models may effectively ignore it.
5.3. Spatial Modeling of Probability Maps for Solar Photovoltaic Installations
Advanced machine learning techniques such as Multilayer Perceptron (MLP), Random Forest (RF), and XGBoost are employed to predict the suitability of regions for large-scale solar photovoltaic (PV) installations, which are shown in
Figure 9. These models process geospatial and environmental data differently, resulting in variations in predicted suitability and emphasizing the importance of evaluating multiple models for robust assessments. In the Cholistan Desert, Punjab, Pakistan, Bahawalnagar in the eastern region has 10.50% of its area classified as “high” and “very high” probability for solar PV installations, attributed to its flat terrain and abundant solar irradiance. Bahawalpur, in the central Cholistan Desert, shows even greater potential, with 11.06% of its area classified similarly, highlighting its strategic suitability for large-scale solar energy development. Conversely, the part of the Cholistan Desert in Rahim Yar Khan exhibits a very low probability for solar PV installations, likely due to its limited road network and topographical factors that reduce its overall feasibility for solar energy projects.
5.4. Area Suitability Distributions
The analysis of the classification maps derived from multiple machine learning models—Random Forest (RF), Multilayer Perceptron (MLP), and XGBoost—revealed distinct suitability classes for solar photovoltaic (PV) installations in the Cholistan Desert of Punjab, Pakistan. These classes, determined by the probability of successful PV installations, are categorized as low, moderate, high, and very high suitability, which are shown in
Figure 9d. For the Random Forest model, the largest portion of the study area, 76.99%, was classified under the low suitability class, indicating a minimal likelihood for successful solar PV installations. This was followed by the moderate class, which accounted for 14.37% of the area, suggesting a slightly better potential for solar energy generation. The high suitability class covered 2.89% of the area, while the very high class, which is the most promising for solar PV installations, made up 5.75% of the total area.
Similarly, the MultiLayer Perceptron model classified 68.25% of the area as low suitability, 12.02% as moderate, 5.10% as high, and 8.63% as very high suitability. XGBoost demonstrated a comparable distribution, with 68.72% in the low class, 16.22% as moderate, 10.61% as high, and 6.45% as very high. These classification results were instrumental in estimating the solar energy generation potential of the Cholistan Desert.
5.5. Model Validation and Comparison
To evaluate the generalization capability of the models, cross-validation was employed, providing a robust measure of their performance on unseen data. The Random Forest (RF) model demonstrated strong and consistent performance, achieving a mean cross-validation accuracy of 86.10% with a low standard deviation of 5.83%. This indicates that the RF model is highly reliable and generalizes well across different subsets of the data. The Multilayer Perceptron (MLP) model also performed competitively, with a mean accuracy of 85.24%, though it exhibited a slightly higher standard deviation of 6.60%, suggesting moderate variability in its predictions. In contrast, the XGBoost model achieved a mean accuracy of 79.35% with a standard deviation of 11.61%, indicating challenges in generalization and stability compared to RF and MLP.
These results highlight the superior performance of the RF model, which not only achieved the highest mean accuracy but also demonstrated the lowest variability across folds. The MLP model, while competitive in terms of mean accuracy, showed greater variability, which may be attributed to its sensitivity to hyperparameters or the complexity of the dataset. The XGBoost model, despite its lower mean accuracy and higher variability, still provides valuable insights, particularly in scenarios where interpretability and feature importance are critical. These findings underscore the importance of selecting the appropriate model based on the specific requirements of the task, balancing accuracy, stability, and computational efficiency.
5.6. Sensitivity Analysis
Sensitivity analysis was conducted on the Random Forest (RF) model to evaluate the impact of varying input features on model performance, quantified through changes in Accuracy and Area Under the Curve (AUC). The most influenced variable was perturbed by ±10%, and the resulting changes in model performance were measured over 100 iterations. The baseline RF model achieved an accuracy of 0.723 and an AUC of 0.921, demonstrating strong predictive performance. Perturbations in features such as Global Horizontal Irradiance (GHI) and slope resulted in negligible changes to both metrics, indicating minimal influence on the model. In contrast, population density exhibited a moderate impact, with an average decrease of 0.0100 in accuracy and a slight increase of 0.0020 in AUC, underscoring its significance in the model’s decision-making process. Similarly, distance to residential areas showed a small negative impact on accuracy (−0.0111) and a negligible effect on AUC (−0.0001), while aspect demonstrated a minor reduction in accuracy (−0.0041) with a marginal improvement in AUC (0.0001). These findings highlight the model’s robustness to certain features while identifying Population Density as a key variable, warranting further investigation and optimization to enhance predictive stability and performance. The results provide valuable insights into feature importance and guide future efforts in refining the model for improved reliability.
6. Conclusions
This study provides a comprehensive framework for identifying optimal sites for solar photovoltaic (PV) power plants in the Cholistan Desert, leveraging Geographic Information System (GIS) and machine learning techniques. By evaluating 14 critical parameters—including solar irradiance, terrain slope, proximity to infrastructure, and land-use constraints—we generated detailed suitability maps that prioritize high-yield zones in Bahawalnagar and Bahawalpur. The Random Forest (RF) model demonstrated superior performance, achieving an AUC of and a CA of , underscoring its effectiveness in modeling the spatial dynamics of solar energy deployment. The results reveal that approximately % of the study area falls within high- and very-high-suitability zones, contributing disproportionately to the total solar energy potential, estimated at 120,475 TWh/year for the region.
The findings of this study hold transformative implications for renewable energy developers, policymakers, grid infrastructure planners, environmental consultancies, and financial institutions. For renewable energy developers, the identification of high-suitability zones reduces site-selection costs and enhances project feasibility, enabling more efficient deployment of solar PV plants. Policymakers can leverage these insights to align incentives with national and global net-zero targets, fostering a supportive regulatory environment for renewable energy projects. Grid infrastructure planners can prioritize transmission routes to high-yield hubs, improving energy distribution efficiency and reducing transmission losses. Environmental consultancies can integrate ecological constraints into site-selection processes, ensuring that solar energy development is balanced with biodiversity conservation. Finally, financial institutions can utilize suitability maps to de-risk solar investments, fostering greater confidence in renewable energy projects and attracting more capital to the sector.
For academic researchers, this study bridges the gap between machine learning and industrial-scale energy planning, offering a replicable blueprint for optimizing solar energy deployment in resource-constrained regions. The methodology’s transferability provides a scalable framework for similar studies in other arid or high-irradiance regions, advancing global climate resilience. By integrating geospatial analysis with machine learning, this research contributes to the growing body of knowledge on sustainable energy planning, offering actionable insights for both theoretical and applied domains. Future research could explore the integration of real-time solar and wind data, enabling multi-source renewable energy planning and further enhancing the accuracy of site suitability models.
The findings of this study not only identify optimal sites for solar PV installations but also provide a scalable, data-driven approach to maximize solar energy potential, reduce costs, and support global climate resilience. By aligning these findings with Pakistan’s renewable energy targets, this research offers a replicable model for stakeholders across industries and academia, fostering collaboration to accelerate the transition to renewable energy. The integration of ecological and socio-economic considerations ensures that solar energy development is both environmentally sustainable and socially equitable, paving the way for a more resilient and sustainable energy future.
Author Contributions
H.A.A.: Conceptualization; H.A.A.: Methodology; H.A.A.: Software; H.A.A.: Validation; H.A.A., J.L., Z.L., A.S., R.A., M.H.B. and H.U.: Formal Analysis; H.A.A.: Data Curation; H.A.A.: Writing—Original Draft Preparation; H.A.A. and R.A.: Writing—Review and Editing; H.A.A. and R.A.: Visualization; J.L.: Supervision. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported in part by the National Natural Science Foundation of China under Grant 62325101 and Grant 62202020, and in part by the Fundamental Research Funds for the Central Universities.
Data Availability Statement
The data presented in this study are available on request from the corresponding author.
Acknowledgments
We acknowledge the support of Beihang University, Beijing, China, for providing the infrastructure and resources required for this study.
Conflicts of Interest
Author Hameed Ullah is employ at The Urban Unit (USPMU) during the course of this research. The other authors declare no competing commercial or financial interests that could be perceived as influencing the study. The Urban Unit (USPMU) had no involvement in the study’s design, data collection, analysis, interpretation, manuscript preparation, or the decision to publish the findings.
References
- Raza, M.A.; Khatri, K.L.; Israr, A.; Ul Haque, M.I.; Ahmed, M.; Rafique, K.; Saand, A.S. Energy demand and production forecasting in Pakistan. Energy Strategy Rev. 2022, 39, 100788. [Google Scholar] [CrossRef]
- Ahmad, H.; Jamil, F. Investigating power outages in Pakistan. Energy Policy 2024, 189, 114117. [Google Scholar] [CrossRef]
- Hassan, Q.; Viktor, P.; Al-Musawi, T.J.; Ali, B.M.; Algburi, S.; Alzoubi, H.M.; Al-Jiboory, A.K.; Sameen, A.Z.; Salman, H.M.; Jaszczur, M. The renewable energy role in the global energy Transformations. Renew. Energy Focus 2024, 48, 100545. [Google Scholar]
- Panagoda, L.; Sandeepa, R.; Perera, W.; Sandunika, D.; Siriwardhana, S.; Alwis, M.; Dilka, S. Advancements in photovoltaic (Pv) technology for solar energy generation. J. Res. Technol. Eng. 2023, 4, 30–72. [Google Scholar]
- Mian, S.H.; Moiduddin, K.; Alkhalefah, H.; Abidi, M.H.; Ahmed, F.; Hashmi, F.H. Mechanisms for choosing PV locations that allow for the most sustainable usage of solar energy. Sustainability 2023, 15, 3284. [Google Scholar] [CrossRef]
- Aghaloo, K.; Ali, T.; Chiu, Y.R.; Sharifi, A. Optimal site selection for the solar-wind hybrid renewable energy systems in Bangladesh using an integrated GIS-based BWM-fuzzy logic method. Energy Convers. Manag. 2023, 283, 116899. [Google Scholar]
- Munkhbat, U.; Choi, Y. GIS-based site suitability analysis for solar power systems in Mongolia. Appl. Sci. 2021, 11, 3748. [Google Scholar] [CrossRef]
- Al Garni, H.Z.; Awasthi, A. Solar PV power plants site selection: A review. In Advances in Renewable Energies and Power Technologies; Elsevier: Amsterdam, The Netherlands, 2018; pp. 57–75. [Google Scholar]
- Hafeznia, H.; Yousefi, H.; Astaraei, F.R. A novel framework for the potential assessment of utility-scale photovoltaic solar energy, application to eastern Iran. Energy Convers. Manag. 2017, 151, 240–258. [Google Scholar]
- Settou, B.; Settou, N.; Gouareh, A.; Negrou, B.; Mokhtara, C.; Messaoudi, D. A high-resolution geographic information system-analytical hierarchy process-based method for solar PV power plant site selection: A case study Algeria. Clean Technol. Environ. Policy 2021, 23, 219–234. [Google Scholar]
- Al-Ruzouq, R.; Shanableh, A.; Yilmaz, A.G.; Idris, A.; Mukherjee, S.; Khalil, M.A.; Gibril, M.B.A. Dam site suitability mapping and analysis using an integrated GIS and machine learning approach. Water 2019, 11, 1880. [Google Scholar] [CrossRef]
- Ashraf, W.M.; Uddin, G.M.; Ahmad, H.A.; Jamil, M.A.; Tariq, R.; Shahzad, M.W.; Dua, V. Artificial intelligence enabled efficient power generation and emissions reduction underpinning net-zero goal from the coal-based power plants. Energy Convers. Manag. 2022, 268, 116025. [Google Scholar]
- Al-Ruzouq, R.; Abdallah, M.; Shanableh, A.; Alani, S.; Obaid, L.; Gibril, M.B.A. Waste to energy spatial suitability analysis using hybrid multi-criteria machine learning approach. Environ. Sci. Pollut. Res. 2022, 29, 2613–2628. [Google Scholar] [CrossRef] [PubMed]
- Petrov, A.N.; Wessling, J.M. Utilization of machine-learning algorithms for wind turbine site suitability modeling in Iowa, USA. Wind Energy 2015, 18, 713–727. [Google Scholar] [CrossRef]
- Yin, P.Y.; Wu, T.H.; Hsu, P.Y. Risk management of wind farm micro-siting using an enhanced genetic algorithm with simulation optimization. Renew. Energy 2017, 107, 508–521. [Google Scholar] [CrossRef]
- Shorabeh, S.N.; Samany, N.N.; Minaei, F.; Firozjaei, H.K.; Homaee, M.; Boloorani, A.D. A decision model based on decision tree and particle swarm optimization algorithms to identify optimal locations for solar power plants construction in Iran. Renew. Energy 2022, 187, 56–67. [Google Scholar] [CrossRef]
- Sun, Y.; Zhu, D.; Li, Y.; Wang, R.; Ma, R. Spatial modelling the location choice of large-scale solar photovoltaic power plants: Application of interpretable machine learning techniques and the national inventory. Energy Convers. Manag. 2023, 289, 117198. [Google Scholar] [CrossRef]
- Cattani, G. Combining data envelopment analysis and Random Forest for selecting optimal locations of solar PV plants. Energy AI 2023, 11, 100222. [Google Scholar] [CrossRef]
- Zubair, M.; Saleem, A.; Baig, M.A.; Islam, M.; Razzaq, A.; Gul, S.; Ahmad, S.; Moyo, H.P.; Hassan, S.; Rischkowsky, B.; et al. The influence of protection from grazing on Cholistan desert vegetation, Pakistan. Rangelands 2018, 40, 136–145. [Google Scholar]
- Haider, S.; Malik, S.; Nadeem, B.; Sadiq, N.; Ghaffari, A. Impact of population growth on the natural resources of Cholistan desert. PalArch’s J. Archaeol. Egypt/Egyptol. 2021, 18, 1778–1790. [Google Scholar]
- Jahangiri, M.; Ghaderi, R.; Haghani, A.; Nematollahi, O. Finding the best locations for establishment of solar-wind power stations in Middle-East using GIS: A review. Renew. Sustain. Energy Rev. 2016, 66, 38–52. [Google Scholar]
- Aly, A.; Jensen, S.S.; Pedersen, A.B. Solar power potential of Tanzania: Identifying CSP and PV hot spots through a GIS multicriteria decision making analysis. Renew. Energy 2017, 113, 159–175. [Google Scholar]
- Sirén, A.P.; Pekins, P.J.; Kilborn, J.R.; Kanter, J.J.; Sutherland, C.S. Potential influence of high-elevation wind farms on carnivore mobility. J. Wildl. Manag. 2017, 81, 1505–1512. [Google Scholar]
- Koc, A.; Turk, S.; Şahin, G. Multi-criteria of wind-solar site selection problem using a GIS-AHP-based approach with an application in Igdir Province/Turkey. Environ. Sci. Pollut. Res. 2019, 26, 32298–32310. [Google Scholar]
- Kuru, A. Solar power plant site selection modeling for sensitive ecosystems. Clean Technol. Environ. Policy 2023, 25, 2529–2544. [Google Scholar]
- Rekik, S.; El Alimi, S. Optimal wind-solar site selection using a GIS-AHP based approach: A case of Tunisia. Energy Convers. Manag. X 2023, 18, 100355. [Google Scholar] [CrossRef]
- Hasti, F.; Mamkhezri, J.; McFerrin, R.; Pezhooli, N. Optimal solar photovoltaic site selection using geographic information system–based modeling techniques and assessing environmental and economic impacts: The case of Kurdistan. Sol. Energy 2023, 262, 111807. [Google Scholar]
- Demir, A.; Dinçer, A.E.; Yılmaz, K. A novel procedure for the AHP method for the site selection of solar PV farms. Int. J. Energy Res. 2024, 2024, 5535398. [Google Scholar]
- Joseph, J.I.; Umoren, A.M.; Markson, I. Development of optimal site selection method for large scale solar photovoltaic power plant. Math. Softw. Eng. 2016, 2, 66–75. [Google Scholar]
- Georgiou, A.; Skarlatos, D. Optimal site selection for sitting a solar park using multi-criteria decision analysis and geographical information systems. Geosci. Instrum. Methods Data Syst. 2016, 5, 321–332. [Google Scholar]
- Alhammad, A.; Sun, Q.; Tao, Y. Optimal solar plant site identification using GIS and remote sensing: Framework and case study. Energies 2022, 15, 312. [Google Scholar] [CrossRef]
- Wang, M.X.; Qu, Y. Approximation capabilities of neural networks on unbounded domains. Neural Netw. 2022, 145, 56–67. [Google Scholar] [PubMed]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Li, Y.; Sun, Y.; Li, J. Heterogeneous effects of climate change and human activities on annual landscape change in coastal cities of mainland China. Ecol. Indic. 2021, 125, 107561. [Google Scholar]
- Diaconu, A.M.; Sulea, M. A review on ensemble methods for classification problems. Comput. Sci. Eng. 2018, 11, 45–52. [Google Scholar]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Arabameri, A.; Cerda, A.; Pradhan, B.; Tiefenbacher, J.; Lombardo, L.; Bui, D. A methodological comparison of head-cut based gully erosion susceptibility models: Combined use of statistical and artificial intelligence. Geomorphology 2020, 359, 107136. [Google Scholar]
- Roshan, K.; Zafar, A. Utilizing XAI technique to improve autoencoder based model for computer network anomaly detection with shapley additive explanation (SHAP). arXiv 2021, arXiv:2112.08442. [Google Scholar]
- Kim, Y.; Kim, Y. Explainable heat-related mortality with random forest and SHapley Additive exPlanations (SHAP) models. Sustain. Cities Soc. 2022, 79, 103677. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).