Ensemble Learning Algorithms for Solar Radiation Prediction in Santo Domingo: Measurements and Evaluation

Ramírez-Rivera, Francisco A.; Guerrero-Rodríguez, Néstor F.

doi:10.3390/su16188015

Open AccessArticle

Ensemble Learning Algorithms for Solar Radiation Prediction in Santo Domingo: Measurements and Evaluation

by

Francisco A. Ramírez-Rivera

^*

and

Néstor F. Guerrero-Rodríguez

Engineering Sciences, Pontificia Universidad Católica Madre y Maestra (PUCMM), Av. Abraham Lincoln Esq. Romulo Betancourt, Santo Domingo 2748, Dominican Republic

^*

Author to whom correspondence should be addressed.

Sustainability 2024, 16(18), 8015; https://doi.org/10.3390/su16188015

Submission received: 28 July 2024 / Revised: 8 September 2024 / Accepted: 11 September 2024 / Published: 13 September 2024

(This article belongs to the Section Energy Sustainability)

Download

Browse Figures

Versions Notes

Abstract

:

Solar radiation is a fundamental parameter for solar photovoltaic (PV) technology. Reliable solar radiation prediction has become valuable for designing solar PV systems, guaranteeing their performance, operational efficiency, safety in operations, grid dispatchment, and financial planning. However, high quality ground-based solar radiation measurements are scarce, especially for very short-term time horizons. Most existing studies trained machine learning (ML) models using datasets with time horizons of 1 h or 1 day, whereas very few studies reported using a dataset with a 1 min time horizon. In this study, a comprehensive evaluation of nine ensemble learning algorithms (ELAs) was performed to estimate solar radiation in Santo Domingo with a 1 min time horizon dataset, collected from a local weather station. The ensemble learning models evaluated included seven homogeneous ensembles: Random Forest (RF), Extra Tree (ET), adaptive gradient boosting (AGB), gradient boosting (GB), extreme gradient boosting (XGB), light gradient boosting (LGBM), histogram-based gradient boosting (HGB); and two heterogeneous ensembles: voting and stacking. RF, ET, GB, and HGB were combined to develop voting and stacking ensembles, with linear regression (LR) being adopted in the second layer of the stacking ensemble. Six technical metrics, including mean squared error (MSE), root mean squared error (RMSE), relative root mean squared error (rRMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and coefficient of determination (R2), were used as criteria to determine the prediction quality of the developed ensemble algorithms. A comparison of the results indicates that the HGB algorithm offers superior prediction performance among the homogeneous ensemble learning models, while overall, the stacking ensemble provides the best accuracy, with metric values of MSE = 3218.27, RMSE = 56.73, rRMSE = 12.700, MAE = 29.87, MAPE = 10.60, and R2 = 0.964.

Keywords:

ensemble learning; evaluation metrics; heterogeneous ensemble learning; homogeneous ensemble learning; hyperparameter; time horizon; solar radiation

Graphical Abstract

1. Introduction

In the last decade, power generation based on photovoltaic technology has experienced accelerated growth worldwide, and trends indicate that it will continue to increase, with an exponential growth in the coming years, motivated by several factors: (1) new environmental policies to mitigate pollutant emissions generated during the conversion of energy from fossil fuel-based systems [1,2]; (2) tax incentives from local governments; (3) the technological maturity achieved. According to energy statistics, solar photovoltaic capacity additions have increased annually by an average of 15% for the period 2016–2022. Considering a conservative scenario, by 2028, capacity additions are estimated to be more than double the values of 2022 [3].

Due to its geographical location, the Dominican Republic presents a favorable scenario for renewable energy expansion, driven by the availability of the renewable resources, notably solar energy, and local incentive policies [4,5]. Local actions are being implemented to decarbonize the energy matrix. In that sense, the government has established a regulatory framework aimed at diversifying energy generation systems to include a proportion of 25% of renewable energy penetration by 2025, which is the equivalent to 2 GW, based on the current power systems [4,6]. Currently, photovoltaic technology leads in installed capacity compared to other renewable resource-based technologies. Based on this, in the coming years, an optimistic scenario is expected, marked by an increase in the percentage of solar PV energy integrated into the grid [4]. However, integrating the power generated by PV systems into the national grid is a complex process that faces many challenges, including the lack of accurate real-time monitoring and control systems, limited grid transportation capacity, and a fragile grid infrastructure. The development of robust predictive tools for estimating solar resources based on high-quality climatic data could help overcome these challenges.

The local climatic conditions are directly correlated to the generation capacity from alternative energies. Solar radiation is one of the fundamental parameters for solar energy technologies, as the energy conversion performance of solar systems is strongly influenced by solar radiation. The transient characteristic of solar radiation can cause fluctuation in PV energy output and transfer instability into the electrical grid. Consequently, mitigating the variability of the solar radiation on PV energy output and its propagation to the grid is essential for maintaining equilibrium and supplying high-quality electric energy [7,8]. For specific locations, the prediction of the available solar resource becomes valuable for the design of systems with operatively efficient planning and performance, while reducing auxiliary energy storage and financial characteristics.

Access to quality climatic data is a limiting factor for developing accurate and generalized predictive tools. Solar radiation is measured with different instruments, such as pyranometers, pyrheliometers, or weather stations, depending on ground-based applications. In developing countries, quality meteorological data are scarce, primarily due to the limited availability of measurement technologies, which are often cost prohibitive. Robust predictive tools could help solve these constraints, as the predictive tools can be extrapolated from one location to another.

Several criteria are reported by researchers to classify solar radiation prediction, considering the characteristics of the predictive tools. These can be categorized as follows: (1) physical models; (2) statistical time series; (3) new intelligent tools; (4) hybrid models. Physical models integrate various robust tools, such as approaches based on the physical principles governing atmospheric processes (NWP), data assimilation processes, satellite data, and sky imaging to model atmospheric processes. For short-term to long-term time horizons, physical models exhibit great prediction ability [9]. The main limitations of physical models are the high computational demands and limited accessibility to prediction parameters [10]. Statistical tools estimate solar radiation values over a time horizon by statistically analyzing the historical evolution of prediction variables. These predictive tools include several popular techniques, such as Autoregressive Moving Average (ARMA) and Autoregressive Integrated Moving Average (ARIMA). Although they are most frequently applied to predict short-term horizons (<6 h) and have lower implementation requirements, their predictive capability decreases as the time horizon increases [11].

New intelligence tools refer to Artificial Intelligence (AI) algorithms, categorized into machine learning (ML) and Deep Learning (DL). Intelligence tools are becoming more popular, driven by the urgent need to extract productive information from the massive amounts of data generated in a wide range of processes. In recent years, the specialized scientific community has made notable efforts to validate AI algorithms for predicting solar radiation, considering climatic and geographical parameters at different locations [12]. Early studies report that AI algorithms are suitable for predicting solar radiation from short-term to long-term time horizons, with high performance [9,11]. AI algorithms are flexible to numerous types of input parameters and recognize nonlinear behavioral patterns with great ability. However, the complicated design code structure and computational time costs are limitations of the AI algorithms [13]. All the AI algorithms are strongly dependent on input data; consequently, quality data contribute to minimizing prediction error. Hence, exploratory analysis and the preparation of the database corresponds to an essential step in developing AI predictive tools. ML regression algorithms are commonly studied to evaluate solar radiation, including Artificial Neuronal Networks (ANNs) with single-layer and multi-layer perceptrons (MLP-NN), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Random Forest (RF), gradient boosting (GB). On the other hand, the subcategory of DL integrates Recurrent Neuronal Networks (RNNs), Convolutional Neural Networks (CNNs), and Long Short-Term Memory (LSTM). The hybrid predictive tools are based on the combination of multiple algorithms to enhance their prediction performance; it could be a combination of physical models with new intelligence tools, or ML with DL and ensemble learning (EL), which combines multiple base learner algorithms to obtain robust predictive tools. EL techniques can be sorted in homogeneous and heterogeneous ensembles.

The major difference between the homogeneous and heterogeneous techniques lies in the types of base learning algorithms operating inside the ensemble. Hence, a heterogeneous ensemble is based on different learning algorithms, while a homogeneous one uses the same type of base learner algorithms to build the ensemble. A homogeneous ensemble could be classified into parallel and sequential based on the manner of base learners that are trained and combined to perform predictions [14]. The following lines showcase the literature review focusing on research studies that used ensemble learning algorithms to capture solar radiation based on historical data measured in diverse geographical locations. Table 1 summarizes the most relevant information of each study.

Hassan et al. [15] explored the potential of bagging (BG), GD, RF, and ensemble algorithms to predict solar radiation components for daily and hourly time horizons in the MENA region and compared the prediction performance of ensemble algorithms with SVM and MLP-NN. The database consists of five datasets collected from weather stations located in five countries of the MENA region, during the period from 2010 to 2013. They did not report the ML techniques employed to clear the data, impute missing values, conduct exploratory data analysis (EDA), or perform feature selection. The results indicate that the SVM ML algorithm offers the best combination of stability and prediction accuracy, although it was penalized by computational costs that are 39 times higher compared to ensemble algorithms. The characteristics of this study are reported in Table 1.

Benali et al. [16] compared the reliability of Smart Persistence, MLP-NN, and RF algorithms in estimating the global, beam, and diffuse solar irradiance at the site of Odeillo (France) for a prediction horizon range of 1 to 6 h. The dataset used to perform this study was based on 3 years’ worth of data, containing 10,599 observations. They did not conduct a complete preprocessing and data analysis process. MAE, RMSE, nRMSE, and nMAE were the statistical metrics used to evaluate their performance. A key finding of this study was that the RF ensemble algorithm predicts the three components of solar radiation with good accuracy.

Park et al. [17] examined the ability of the Light Gradient Boosting Machine (LGBM), a homogeneous sequential ensemble algorithm, to capture multistep-ahead global solar irradiation for two regions on Jeju Island (Republic of Korea) with a time horizon of one (1) hour. In this study, the prediction of the performance of LGBM was compared with two homogeneous ensembles (RF, GB, XGB) and Deep Neural Network (DNN) algorithms, using the metric MBE, MAE, RMSE, and NRMS. They found that the XGB and LGBM methods showed a similar performance, and the LGBM algorithm ran 17 times faster compared to XGB.

Lee et al. [18] proposed a comparative study to estimate Global Horizontal Irradiance (GHI) in six different cities in the USA using six machine learning predictive tools: four homogeneous ensemble learning methods (BG,BS,RF,GRF), SVM, and GPR. The database consists of one year of meteorological data from each city. They selected the input parameters empirically based on similar studies reported in the literature, without using the algorithms to determine the most relevant input parameters. They highlighted that the ensemble learning tools were particularly remarkable compared to SVM and GPR. In particular, the GRF tool presented superior metrics compared to the other ensemble learning methods.

Kumari and Toshniwal [19] elaborate on a homogeneous parallel ensemble based on the stacking technique, combining XGB and DNN algorithms to predict hourly GHI, using climatic data amassed from 2005 to 2014 at three different locations in India. The results were compared with RF, SVM, XGB, and DNN. They concluded that the proposed ensemble reduces the prediction error by 40% in comparison with RF and SVM.

Huang et al. [20] conducted research to evaluate the performance of twelve (12) machine learning algorithms in Ganzhou (China), with daily and monthly time horizons. The daily dataset was gathered from a period dating from 1980 to 2016, with a total of 13,100 data points and 432 monthly average points extracted from the daily dataset. RMSE, MAE, and R2 were used as statistical metrics to compare the capacity of predictive tools. They did not describe the process of combining, clearing, and filtering the data to assemble the database. They found that the GB regression algorithms excelled in accuracy over other predictive tools for the daily dataset, with R2 = 0.925, whereas the XGB regression showed the best predictive ability for the monthly datasets, obtaining R2 = 0.944.

Al-Ismail et al. [21] carried out a study to compare the predicting capacity of four homogeneous EL algorithms, namely adaptive boosting (AdaBoost), gradient boosting (GB), Random Forest (RF), and bagging (BG) to capture the incident solar irradiation in Bangladesh. The database used was collected from 32 weather stations distributed across different locations during the period from 1999 to 2017. This work did not report the process of assembling, cleaning, and filtering the database. Additionally, a dimensionality reduction algorithm was not used. Hence, the preparation stage was incomplete. According to the results, the GB regression algorithms excelled in predictive performance compared to the other EL algorithms.

Solano and Affonso [22] proposed several heterogeneous ensemble learning predictive tools based on voting average and voting weighted average, combining the following algorithms: RF, XGB, categorical boosting (CatBoost), and adaptive boosting (AdaBoost) to estimate solar irradiation at Salvador (Brazil) for a time horizon prediction in a range from 1 to 12 h. They used the k-means algorithm to cluster data with similar weather patterns and capture seasonality, while the dimensionality of the input parameters was reduced by the output average of the three individual algorithms applied. Their results suggest that the voting weighted average combining CatBoost and RF offered superior prediction performance compared to the individual algorithms and other ensembles, with the following average metrics: an MAE of 0.256, an RMSE of 0.377, an MAPE of 25.659%, and an R2 of 0.848.

Based on the literature review, a few studies have evaluated the predictive performance of the parallel and sequential homogeneous ensemble learning algorithms to capture solar radiation based on historical data measured at single or multiple geographic locations. The following considerations can be drawn from the literature review:

In general, “ensemble learning” exhibited superior predictive ability compared to other individual ML algorithms.
Most of the previous studies have not completed or clearly described the ML preprocessing and data analysis stage, which are considered fundamental in the development of ML algorithms.
A minimal number of the reported articles provided information on the number of points in the collected climate database (Table 1). The number of points in the database is a critical aspect because the optimal training of ML algorithms is strongly dependent on the size and quality of the database. In parallel, all research was conducted to estimate solar radiation using a prediction horizon of greater than or equal to one hour.
No studies have applied the homogeneous ensemble algorithm identified as Histogram-based Gradient Boosting (HGB) to predict solar radiation.
None of the manuscripts have proposed a comparative analysis to evaluate the prediction performance of voting and stacking ensemble techniques by combining homogeneous ensembles based on sequential and parallel learning (Table 1).

This research focused on addressing the limitations found in the literature on solar radiation prediction by using new intelligent algorithms. In this sense, a new tool for predicting solar radiation was developed based on an ensemble learning algorithm, using a database with a time horizon of 1 min. The meteorological measurement was obtained from a weather station located at Santo Domingo de Guzman, the Dominican Republic. Firstly, a complete ML preprocessing and analysis stage were carried out to clean, impute, standardize, and select the most relevant input parameters for the development of the prediction tool. Next, the following homogeneous ensemble algorithms were evaluated: RF, ET, XGB, GB, AGB, HGB, and LGBM. Then, the homogeneous ensemble algorithms with the best performance were selected to build the heterogeneous voting and stacking algorithms. Lastly, the predictive performance of the voting and stacking algorithms were compared to determine which tool has superior ability to capture the trend in the test data. MAE, MSE, RMSE, MAPE, and R2 were the statistical metrics used to evaluate the predictive performance of the ensemble algorithms. The major contribution of this study is described as follows:

A comparative analysis was performed to select the subset of input features that best fit the characteristics of the climate data by employing five ML algorithms to reduce the dimensionality of the database.
Histogram-based Gradient Boosting (HGB) was adopted for the first time to predict solar irradiance in tropical climates with a 1 min time horizon.
A new tool for solar radiation prediction based on a heterogeneous ensemble learning algorithm, combining homogeneous learning with the highest performance capability, was proposed.
The performance prediction of the nine ensemble learning algorithms was evaluated using the MAE, MSE, RMSE, MAPE, and R2 metrics.

2. Description of Ensemble Learning Algorithms

In recent years, ensemble learning methods have received significant attention from the scientific community, primarily due to the urgent need to improve prediction performance across a wide range of applications, including pattern recognition, natural language processing, medical diagnostics, engineering sciences, energy, environmental sciences, and climate forecasting [23].

An ensemble combines a number of trained base learners to generate a single learner with a superior generalization ability to solve problems effectively, with minimal prediction errors [12]. This superior generalization ability is achieved through various techniques that correct the weaknesses of the base learners, improving the collective response to increase the effectiveness of predictions with unseen data. The techniques involve reducing overfitting through the diversity and aggregation in the ensemble, the trade-off relationship between bias and variance, keeping bias fixed while reducing variance, or fixing variance while reducing bias, and mitigating the noise in the data [14]. This leads to more reliable predictions and generalized models.

Ensemble learning can be divided into two subcategories: (1) homogeneous ensembles, which use the same type of base learners, and (2) heterogeneous ensembles, which are created with different types of base learners. Homogeneous ensembles can be subdivided into sequential or parallel, commonly referred to as boosting and bootstrap aggregating (bagging), respectively. Figure 1 illustrates the general architecture of ensemble learning algorithms. As can be noted, the main difference between homogeneous sequential and parallel ensembles lies in the way the training dataset is manipulated during the training process of the base learners. In the parallel ensembles, the original training set is resampled to generate new subsets of training data.

Thus, multiple base learning algorithms of the same type are trained separately on the generated subsets. Then, the outputs of the base learners are aggregated to calculate overall prediction values (Figure 1). In the sequential ensembles, the training process of base learners is carried out iteratively, with each base learner depending on the information provided by the previous learners; as a result, base learners learn from the errors of previous iterations by increasing the importance of incorrectly predicted training instances in future iterations [24].

2.1. Parallel Homogeneous Ensemble

2.1.1. Random Forest (RF)

RF was introduced by Breiman [25], and since then, it has become one of the most widely used ensemble learning methods for classification and regression tree ML problems. This is probably because RF works with efficiency and relative simplicity. RF uses randomized decision trees as base learners, where each decision tree is trained with a different training set resulting from random resampling, with the replacement of the original dataset. For regression ML problems, the output of an RF prediction is calculated by averaging the output predictions associated with each randomized decision tree. The RF ensemble has numerous advantages: it provides a ranking of the importance of variables in the process, reduces problems with overfitting, is not affected by outlier observations, can be parallelized for fast implementation, and has a small hyperparameter space. In contrast, it consumes significant computing resources when many trees and a large database are involved. A rigorous explanation of RF fundaments can be found in [26].

2.1.2. Extremely Randomized Trees (ET)

The modified algorithm of RF, presented by Geurts et al. [27], differs from RF in various aspects. Firstly, regarding the level of randomness, ET goes further, making the process completely random: generating training sets from the original data randomly and without replacement, while also randomizing both attribute selection and threshold determination. Secondly, the computational time cost of ET is lower than that of RF. Ref. [28] reported that ET is appropriate for working with large datasets, while for small datasets, it may be prone to overfitting.

2.2. Sequential Homogeneous Ensemble

In recent years, several ensemble methods based on sequential techniques have been proposed. The core of any sequential ensemble is the boosting algorithm; therefore, new tools modify or introduce innovations to this algorithm. In the following lines, a brief description of the main characteristics of the sequential ensembles adopted for this work is presented.

2.2.1. Adaptive Boosting (AB)

AB was developed by Freund and Schapire [29]. The algorithm was first introduced for ML classification and then for ML regression problems. AB differs from boosting in several aspects: (1) the base learner is forced to focus on the weights of incorrectly classified instances in the training set; and (2) the final prediction of AB is obtained by combining the results of all the base learners through the rule of weighted majority voting. It is widely adopted due to its fast implementation, simple structure, reduced number of hyperparameters, and good compatibility [26]. Possible disadvantages for the algorithm include its high sensitivity to noise and the deterioration of its predictive capacity with scarce data.

2.2.2. Gradient Boosting (GB)

It was proposed by Friedman in a series of studies and functions as a general framework, using decision trees as base learners [30,31]. The GB decision trees is a very important ensemble algorithm, motivated by the fact that it is the baseline of the new gradient algorithms: XGBM, LGBM, and Categorical Boosting (CatBoost). Similar to AB, it applies the sequential ensemble principle. However, the GB focuses on working with large errors resulting from the previous iterations. GB is based on a gradient descent optimization algorithm to minimize the loss function, while enhancing the prediction performance.

The learning process is iterative, as GB generates a series of base learners. The first base learner is trained with the original dataset to make predictions and produce residual errors; then, the actual base learner is trained with the residuals of its predecessor. A solution to the process is achieved when convergence to the minimum error value is reached by following the direction of the negative gradient, resulting in robust future predictions. The main components of the GB algorithm can be classified in (1) base learners, typically a decision tree algorithm; (2) loss function; and (3) regularization. The GB decision tree is a strong ensemble algorithm with the ability to achieve high prediction performance, capture complex patterns in the data, and work better for low-dimensional data, though it may tend to overfit with noisy data [32]. A deep explanation of the statistical fundamentals of the Gradient Boosting algorithm can be found in reference [33].

2.2.3. Extreme Gradient Boosting (XGB)

The GB decision tree algorithm was updated to optimize the tree structure and to implement regularizations into the loss function to control overfitting. XGB, similar to the RF and ET algorithms, can be parallelized, resulting in a faster learning procedure that allows for quicker exploration. It was first introduced in 2016 [34]. Since then, it has been applied to solve many prediction problems in different fields, such as finance, healthcare, e-commerce, and so on.

2.2.4. LightGBM (LGBM)

It is another efficient gradient boosting decision tree framework proposed by Microsoft collaborators [35] to work with large datasets, based on the premise that the computational efficiency and scalability of GB and XGB algorithms needed improvement. LGBM introduces several novel advances to GB algorithms: it uses histogram-based splitting algorithms, which bucket continuous attribute values into discrete bins; employs leaf-wise tree growth with deep limitation; and incorporates sample weighting Gradient-Based One-Side Sampling (GOSS), and Exclusive Feature Bundling (EFB). According to the authors, all these improvements result in the following advantages: a faster training process, lower computational cost, the ability to work with large-scale datasets, and better accuracy.

2.2.5. Histogram-Based Gradient Boosting (HGB)

Recently, Scikit-Learn proposed a version of Histogram-based Gradient Boosting decision Trees [36,37]. According to them, the algorithm is based on a modern gradient boosting implementation that is comparable to LGBM and XGB. Their HGB algorithm offers several advantages regarding GB, making it an interesting tool for predictive modelling. It includes several available loss functions, early stopping to prevent overfitting, and support for missing values, which helps avoid the need for an imputer. Additionally, a faster training process is provided for datasets larger than 10,000 points.

2.3. Heterogeneous Ensemble Learning

2.3.1. Voting

Voting is a heterogeneous ensemble learning method that aggregates the output predictions of multiple models to improve overall prediction performance. Voting is considered a meta-learner because it trains several base learners, each with a complete dataset, and then it integrates their predictions using an averaging approach to obtain a final output. Voting can be classified by the manner in which the predictions are combined: (1) majority voting; (2) simple average voting, where the final prediction is calculated with the average value of the prediction results of the individual base learners; and (3) weighted average voting; in this case, the overall prediction is estimated using the weighted arithmetic mean, assigning different weights to the base learners depending on their individual performance.

2.3.2. Stacked Generalization

It is classified as a superior heterogeneous ensemble learning method, in which aggregation techniques are used to combine multiple base learners in a two-layer structure [38]. In the first layer, several base learners are trained in parallel, each one with the same training set, and the resulting predictions of the base learners become a new output dataset. In the second layer, the output dataset from the first layer is used as input to train a second level ML algorithm, which is labeled as the meta learner. Then, the final prediction is the output of the second-level ML algorithm. Practical evidence [39] shows that training a simple ML algorithm (such as linear regression) instead of a complex model in the second layer could prevent overfitting problems.

3. Materials and Methods

The proposed methodology for this work consists of a computational simulation for modeling ensemble ML algorithms based on historical climate data. Python programming language, supported by the following open-source libraries: NumPy (v1.26.4), Pandas (v2.2.1), Searbon (v0.13.2), scikit-learn (v1.4.1), XGBoost (v1.7.3), and LightGBM (v4.3.0) was used to perform the simulations. The computational resources utilized for the implemented simulations are described in Table 2.

The proposed workflow is shown in Figure 2. As can be seen, first, a database was collected based on measurements of local meteorological parameters by means of a local weather station. Second, preprocessing and analysis were applied to the database. Then, the training process was conducted to find the appropriate values of hyperparameters. Finally, an evaluation based on technical metrics and analysis of the results was performed.

3.1. Data Collection

The study site is located in the city of Santo Domingo, the National District of the Dominican Republic, a Caribbean Sea country. Santo Domingo is characterized by a tropical savanna climate with average annual values; minimum/maximum temperatures in the range of 22 °C to 28 °C, annual rainfall of 1380 mm and relative humidity around 85% [40]. For the city of Santo Domingo, according to the research in Refs [41], the average values of Global Horizontal Irradiation (GHI) vary in the range of 5.2–5.6 kWh/m²/day and the annual average daily sunshine is 8.6 h. A map of the Dominican Republic with information on the average values of GHI for the period from 1999 to 2018, is shown in Figure 3.

The weather station, model name Vantage Pro2 Plus (Davis Instruments, Hayward, CA, USA), was installed on the roof of the Faculty of Health Sciences and Engineering (FCSI) building of the Pontificia Universidad Católica Madre y Maestra (PUCMM), at a latitude of 18°27′46.59″ N, a longitude of 69°55′47.60″ W and an elevation of 50 m above sea level. The weather station is illustrated in Figure 4.

The Vantage Pro2 Plus weather station is equipped with an Integrated Sensor Suit (ISS), to convert the meteorological parameters into output electrical signals, and a console for real time monitoring, internal operation, and data logging. The characteristics of the meteorological parameters measured by the weather station are shown in Table 3. A wide range of parameters can be measured by the Integrated Sensor Suit of the Vantage Pro2 Plus, including temperature, barometric pressure, wind direction and speed, solar radiation, UV solar radiation, rainfall levels, relative humidity, and dew point.

It can also calculate new indices based on the combination of measured parameters, such as THW, THSW, heat index, and wind chill. The minimum and maximum values of the meteorological parameters for certain periods of time can also be provided by the console. Additionally, the console incorporates internal sensors to measure the temperature, humidity, and derivative parameters at the location where it is mounted. It is worth mentioning that the integrated sensor suit acquired did not include the UV sensor. Therefore, the UV index was not considered in this study. The console was configured to record meteorological parameters every minute, and the data were transferred to the computer unit via a software interface (WeatherLink v6.0.5) The database was created by integrating all meteorological measurements taken from January to May 2022. The size of the database without preprocessing corresponds to 170,861 observations and 35 attributes.

3.2. Data Preprocessing and Analysis

The data preprocessing and analysis stage is fundamental to the development of robust ML algorithms. To carry out this stage, the raw database was first subjected to a careful cleaning process, removing measurements with solar irradiance values lower than 5 W/m² (at night hours, low solar altitudes) by applying a filter to consider only the sunlight available from 7:30 a.m. to 6:30 p.m. After selecting the daily sample range, the few missing values present in the database (0.008% of the data) were replaced individually with new values using imputation algorithms. The following strategies were executed for the imputation process: the missing values in the categorical parameters were filled with the most frequent values by applying the univariate algorithm, while for the numerical parameters, the nearest neighbor algorithm was adopted to replace each missing value. As a result of the cleaning process, a new dataset was created, with a daily sample of 11 h and a size of 78,536 observations and 35 attributes (about 54.1% of the data were not used).

Exploratory data analysis (EDA) was conducted to examine the characteristics of the dataset resulting from the cleaning process. In general, the wind blows from the the north (N), northeast (NE), and north–northeast (NNE) directions, with an average wind speed of 2.18 m/s, an average outdoor air temperature of 26.93 Celsius, and a relative humidity of 76%. Solar radiation showed an average value of 436.85 W/m², a maximum value of 1211 W/m², and a minimum value of 5 W/m². Most of the solar radiation values were collected when the wind direction was from the north, as can be seen in Figure 5a,b. The north and northeast wind directions were associated with the highest and lowest variability in the solar radiation values, respectively (Figure 5a). In Figure 5a, it can be noted that the median and interquartile range of solar radiation exhibited similar values in the north–northeast and northeast directions, while the northeast wind direction shows the most compact distribution.

To identify possible outlier values, the interquartile range technique was applied to all parameters in the dataset; as a result, no outlier values were found. As observed in Figure 5a,b, the wind direction (WD) and high wind direction (HD) parameters show a certain degree of variability in solar radiation. To study the propagation effects on solar radiation, the dummy ML technique was used to convert categorical data into numerical values and visualize their contribution to the objective variable in the coefficient matrix.

The distribution of solar radiation by wind directions for the range of daily sun hours is shown in Figure 6a. As can be seen, a line connects the maximum radiation values for each hour, resulting in a figure of merit for evaluating solar radiation behavior. Approximately 75.67% of the solar radiation measurements were captured when the wind was blowing from the north(N) direction, 23.75% corresponded to the north–northeast (NNE) wind direction, and only 0.09% were taken in the northeast (NE) wind direction. The solar radiation observations are distributed by hours as follows: 55% of the solar radiation values were scattered in a time range from 10:30 a.m. to 4:30 p.m. (from the 4th to the 9th hour), and 18% of the solar radiation points were captured during the first and last hours of the daily solar sample. The average values of solar radiation by daily sun hours are shown in Figure 6b. The trend in the figure indicates that the maximum average value of solar radiation was obtained from 12:30 p.m. to 1:30 p.m. (the 6th hour of the daily solar sample), with a value of 676.45 W/m², while the minimum value was obtained at sunset (the 11th hour of the daily solar sample, 5:30 p.m–6:30 p.m.).

In order to explore the relationships between the parameters in the database, a Pearson correlation coefficient matrix was generated and illustrated by a heatmap plot (Figure 7). The following considerations can be obtained for the heatmap:

Arc.Int and Heat D-D parameters were eliminated from the dataset, as they reported constant values (cero variability).
In Figure 7, pairs of correlated input predictor parameters can be identified based on correlation coefficient values higher than 0.8 or lower than −0.8 (indicating collinearity); Wind Chill, Heat Index, THW Index, and Cool D-D are each separately correlated with Temp Out. Wind run is associated with wind speed; EMC is related to In Hum. In Hum and In Temp are correlated, as are In Dew and Dew Pt. Rain is correlated with rain rate. Additionally, there is a strong linear correlation between many measured meteorological parameters and the high (Hi) and low (Low) registered values corresponding to each parameter. This effect could be because a small timeframe was set for updating the DAQ lecture (1 min/lecture). Therefore, for many parameters, the high–lows and many measured values registered do not differ. As a consequence, the following input predictor parameters were deleted from the dataset to prevent the propagation of collinearity during the subset feature selection process and to avoid possible bias in the technical evaluation metrics: Wind Chill, Heat Index, THW Index, Cool D-D, Wind Speed, In EMC, In Hum, In Dew, Rain Rate, Hi Temp, Low Temp, Hi Speed, and Hi Solar Rad. (stored high and low values).
Wind direction (Wind Dir) and high wind direction (Hi Dir) parameters could have some influence on solar radiation, based on Figure 5a,b. Therefore, they were converted to numerical values using the dummy technique and included in the correlation matrix, labeled as WD_N, WD_NE, WD_NNE, HD_N, HD_NE, HD_NNE. In Figure 7, it can be seen that WD_NNE and HD_N influence solar radiation.
The solar energy parameter is computed from solar radiation, so its collinearity is structural. Therefore, solar energy was not included in the dataset used for the feature selection process.

Based on the exploration process of the heatmap, 17 parameters (features) were removed from the database, reducing the number of parameters from 37 to 20. As a result, a new refined dataset, with 78,536 observations and 20 attributes was generated. The dataset was used to determine the most relevant subset of features by implementing a comparative analysis with five feature selection methods. The methods with the best score were adopted to train the ensemble learning algorithms. For the dataset, a Pearson correlation matrix with a heatmap is shown in Figure 8. As can be seen, most of the input parameters have a weak relationship with solar radiation.

3.3. Splitting the Dataset

The strategy used to stratify the dataset is crucial for evaluating the ML prediction tools and achieving excellent results. Based on the chronological characteristics of the dataset, the time series cross-validation stratified split strategy was used to divide the dataset into training and test sets, with a proportion of 80:20. As a result of the splitting process, the training set consisted of 62,829 observations, while the test set consisted of 15,707 observations; both the training and test sets have 20 attributes. This stratification strategy ensures that the training process of the predictive tools is carried out on the chronologically historical dataset, and the performance is evaluated on the future dataset. The normal strategy of shuffling and randomly stratifying the dataset is not suitable for this study because the ML predictive tools could learn from the future behavior of unseen data during the training process, thus improving the technical evaluation metrics. However, this is not a realistic scenario.

3.4. Standardization of the Dataset

The standardization technique identified as robust scaler was applied to the dataset to transform the input variables to a specific scale range with a similar distribution. The robust scaler, which uses the median and interquartile range to scale the measurement of each input variable, is given by the following equation:

X_{R S} = \frac{X_{i} - X_{m e d i a n}}{I Q R}

(1)

where

I Q R

is the interquartile range for the input variable,

X_{m e d i a n}

is the median value for the measurements of each input variable,

X_{i}

represents the measured values,

X_{R S}

is the new value scaled using the robust technique. The robust scaler was considered for standardizing the dataset because it is not affected by outlier observations, which could be advantageous when working with the random and chaotic characteristics of weather conditions.

3.5. Feature Selection

ML models are very sensitive to input variables, so selecting a relevant subset of input variables improves the predictive ability of the model. The dataset has a dimension of 78,536 observations × 20 attributes. Therefore, it contains many variables that may cause noise or may not propagate their effects to the objective variable. Given these characteristics, it is necessary to reduce the dimensionality of the feature space to obtain a smaller dataset without penalizing the predictive performance of the ML algorithms. Several methods are available in the literature to reduce the dimensionality of the dataset [43].

In this study, a comparative analysis was carried out using five feature subset selection methods to determine the appropriate subset of input features. For this purpose, the first step was to obtain a subset of features (input variables) generated by each of the feature selection methods. Then, each subset of features was used to train five ensemble learning algorithms on the training set, using the default values of the hyperparameters. Finally, the predictive performance of each ensemble learning algorithm was evaluated using the coefficient of determination (R2) on the test set to determine which of the five feature subsets provided the best performance, based on the R2 score. The results of the comparison for the five feature selection methods are reported in Table 4.

The following is a brief description of the feature selection methods adopted to select an appropriate subset of input features.

3.5.1. The Pearson Coefficient

This was the first method used to select a subset of relevant input features. Figure 9a shows the relationship between the input features and solar radiation, with a filter applied to consider only coefficient values higher than 0.1 and lower than −0.1. By applying these filters, a subset of eight input parameters was obtained (Table 4).

3.5.2. Recursive Feature Elimination (RFE)

The method was adopted with RF as the external ML algorithm. The main purpose of RFE is to create a subset of features by recursively eliminating the least important features. The ranking generated by the algorithm, from the most important feature to the least important feature, is shown in Figure 9b.

3.5.3. SelectKBest (SKBest)

It is a univariate feature selection method in the scikit-learn library that examines each feature individually to determine and select features based on their highest results relative to the objective variable. The configuration and evaluation of the ensemble learning algorithms with subset of features generated by the SelectKBest methods are reported in Table 4.

3.5.4. Sequential Future Selection (SFS)

The wrapper method, based on the iterative approach, reduces the dimensionality of the dataset by prioritizing the features with the highest evaluation metric to create a subset of features that strongly influence the objective variables. SFS has two iterative direction techniques and can be classified as forward and backward. The main difference between the forward and backward schemes is the direction of the iterative process. In the forward scheme, the ML selection algorithm begins without features, and in each iteration, it adds a feature one by one, choosing the one with the most predictive ability. In the backward scheme, the process begins with all the features of the dataset, and in each iteration, it removes features one by one until obtaining a smaller subset that enhances performance prediction. The results of both SFS techniques are reported in Table 4.

The subset of features obtained with the RFE selection method, adopting the RF regressor as the external ML algorithm, outperformed the other feature selection methods, with slightly better scores for each ensemble learning algorithm tested (Table 4).

Therefore, the subset of features selected for the training and evaluation process of the ensemble learning models includes the following eight features: {In Temp, In Density, Out Hum, Bar, THSW Index, Wind Speed, Dew Pt., Temp Out}. As a result of the feature subset selection process, a new reduced dataset with 78,536 observations × 8 attributes was generated.

The distribution curve of the selected subset of input features, standardized by the robust scaler technique (Equation (1)), is shown in Figure 10. As noted, the eight features are scaled to the same range, and their distributions are very similar, ensuring an equal contribution from each feature.

3.6. Training Process

The training procedure is a critical stage in the ML methodology, since it is required to find a set of hyperparameters that maximize the predictive performance and minimize the general expected loss of the ML algorithms. Currently, several optimization strategies can be identified in the literature to find an appropriate hyperparameter configuration for an ML algorithm trained on a dataset [44]. A necessary step in the hyperparameter tuning process consists of cross-validation (CV), which is a statistical technique to evaluate the accuracy of ML algorithms during the training process. CV iteratively partitions the dataset into training and testing portions, training the ML algorithm on some of these portions, and evaluating it on the remaining test portion.

In this work, the optimization strategy was established based on a random search of the hyperparameters, combined with cross-validation (CV). Random search is an optimization technique that implements random sampling over a predefined search space to find a set of appropriate hyperparameter values and evaluates the performance of ML algorithms using CV across the training dataset. The hyperparameter tuning process for all the ensemble learning algorithms studied was performed using the RandomizedSearchCV tool, available in the open source Scikit-Learn Python Library. The procedure for using RandomizedSearchCV can be described in the following steps: (1) define the training set clearly; (2) use ensemble learning algorithms for hyperparameter optimization; (3) create the hyperparameter space to search and find the best values; (4) apply CV on the training set; (5) set the depth of exploration in the hyperparameter space; (6) choose a metric to score the accuracy of the trained ML models. The corresponding hyperparameter values for each tuned ensemble and the computational cost for the training process can be seen in Table 5.

The hyperparameter search space was developed based on the structure of each ensemble algorithm, and the seed values for each of the hyperparameters were assigned empirically. The coefficient of determination (R2) was used during the hyperparameter tuning process as a metric to score and select the ensemble with the best overall score results. For cross-validation, the kfold cross-validation strategy was adopted, with five fixed folds (K = 5), without shuffling the training set, because the dataset corresponds to a historical series, and unseen or future data could be filtered out. During the cross-validation process, the training dataset (80% of the data) is partitioned into five folds (portions). For each fold, the ML ensemble learning model is trained using four folds, while one-fold is retained as a test set to evaluate accuracy. The training and testing sets change across each fold. The final score is obtained by calculating the average of the five folds. The depth of the exploration in the hyperparameter search space represents the number of iterations needed to explore the predefined hyperparameter space. It is a very complex parameter, and to the best of the authors’ knowledge, until to now, no rule has yet been reported to define it. Therefore, as an alternative, it can be defined empirically, considering the size of the hyperparameter space and the available computing capacity. The hyperparameters’ values set for each algorithm and the number of cores used for each ensemble learning model are reported in Table 5.

3.7. Evaluation Metrics

The predictive efficiency in the performance of ML algorithms is quantified using statistical metrics that indicate the degree of deviation between the predicted values and the real values. In simple terms, these metrics reflect how close the predicted outcomes of the model are to the real values. In this work, six statistical metrics were used to evaluate the predictive ability of each ensemble learning algorithm trained to approximate the real values of solar radiation measurements: mean squared error (MSE, Equation (2)), root mean squared error (RMSE, Equation (3)), relative root mean squared error (rRMSE, Equation (4)), mean absolute error (MAE, Equation (5)), mean absolute percentage error (MAPE, Equation (6)), and the coefficient of determination (R2, Equation (7)). Several metrics were selected to thoroughly examine the error generated when comparing the ensemble learning prediction results with the real measured values.

M S E = \frac{1}{n} \sum_{i = 0}^{n} {(y_{m e a s, i} - y_{P r e d, i})}^{2}

(2)

R M S E = {[\frac{1}{n} \sum_{i = 0}^{n} {(y_{m e a s, i} - y_{P r e d, i})}^{2}]}^{\frac{1}{2}}

(3)

r R M S E = \frac{{[\frac{1}{n} \sum_{i = 0}^{n} {(y_{m e a s, i} - y_{P r e d, i})}^{2}]}^{\frac{1}{2}}}{{\bar{y}}_{m e a s}}

(4)

M A E = \frac{1}{n} \sum_{i = 0}^{n} |y_{m e a s, i} - y_{P r e d, i}|

(5)

M A P E = \frac{1}{n} \sum_{i = 0}^{n} |\frac{y_{m e a s, i} - y_{P r e d, i}}{y_{m e a s, i}}|

(6)

R^{2} = 1 - (\frac{{S S}_{R}}{{S S}_{T}}) = 1 - \frac{\sum_{i = 0}^{n} {(y_{m e a s, i} - {\bar{y}}_{P r e d, i})}^{2}}{\sum_{i = 0}^{n} {(y_{m e a s, i} - {\bar{y}}_{m e a s})}^{2}}

(7)

where

y_{m e a s}

is the measured value of solar radiation,

y_{P r e d}

is the predicted value of solar radiation,

{\bar{y}}_{m e a s}

is the average measured value of solar radiation,

{S S}_{R}

is the residual sum of squares, calculated by summing the squares of the differences between the measured and predicted solar radiation values, and

{S S}_{T}

corresponds to the total sum of squares.

n

is the number of observations (measured values) in the dataset.

If the value of the coefficient of determination (R2) is close to 1, this indicates that the measured and predicted solar radiation values are strongly correlated, and therefore values very close to 1 are preferred. For the mean error metrics MAE, MAPE, MSE, RMSE, and rRMSE, the ideal scenario is negligible deviation between the measured values and the predicted results of the model. In general, lower values of the mean error metrics denote better prediction performance and a more accurate model.

4. Discussion and Results

A database was created by integrating the meteorological parameters measured with a time horizon of 1 min from January to May, using a weather station located at latitude 18°27′46.59″ N, longitude 69°55′47.60″ W. The raw database consists of 170,861 observations and 35 attributes. The database was prepared for the training process, and random search optimization strategies were applied to find the best hyperparameters for each of the seven ensemble learning models. Then, the seven ensemble learning models were built using the optimal hyperparameters. In this section, the predictive performance of homogeneous and heterogeneous ensemble learning models is evaluated, and the results are analyzed. This section is divided into three parts: first, an evaluation of the seven homogeneous ensemble learning models is conducted; second, an evaluation of the heterogeneous ensemble learning models is conducted; and finally, an examination of the generalization ability of the best ensemble learning models is performed.

4.1. Evaluation Homogeneous Ensemble Learning

Seven homogeneous ensemble learning models were built: two parallel RF and ET, and five sequential models, AGB, GB, XGB, HGB, and LGBM. The values of the statistical metrics used to evaluate the effectiveness of the predictive performance of the ensemble learning algorithms using the test set are reported in Table 6.

A global overview of the evaluation metrics reveals that most of the homogeneous ensemble learning models built for the estimation of solar radiation works with relatively good performance. The sequential homogeneous ensemble learning models present better predictive power performance compared to the parallel ensemble learning models. Examining the metrics reported in Table 6, it is clear that all ensemble models, except for AGB, have rRMSE values in the 10 to 20% range. According to the literature [45,46], they could be classified as models with good accuracy. The MSE is smaller for sequential homogeneous learning compared to parallel learning, resulting in predictions with less deviation. Except for AGB, parallel and sequential learning models exhibit similar MAE values. The difference between them can be considered very small, which could indicate that both parallel and sequential learning models perform with good accuracy in the central region of the dataset. The coefficient of determination (R2) ranges from 0.900 to 0.965, with the sequential learning GB, XGB, HGB, LGBM outperforming RF and ET in terms of goodness of fit.

The comparison between measured and predicted solar radiation for homogeneous ensemble learning is shown in Figure 11a–g. This section is divided into subheadings. It provides a concise and precise description of the experimental results, their interpretation, and the experimental conclusions that can be drawn. A careful comparison is required to identify the homogeneous ensemble learning model that offers the best balance between the evaluation metrics and the computational cost, without sacrificing performance and accuracy. In the first place, for the two parallel ensembles, ET outperforms RF in both accuracy and performance. However, ET is penalized by high time consumption during the hyperparameter tuning process (Table 5), consuming about twice the training time of RF. Thus, based on this evidence, ET is a better prediction option compared to RF when time consumption and computational resources are not a restriction.

For the five sequential ensembles, HGB shows better performance metrics compared to GB, XGB, AGB, and LGBM, while consuming less computation time. Comparing the metrics of ET and HGH, HGB outperforms ET in terms of MSE, RMSE, R2, and training time cost. In contrast, ET has slightly lower MAE and MAPE. The result of the comparison indicates that HGH provides superior ability to capture the trend of the measured solar radiation and has the best overall metrics among the homogeneous ensemble learning models trained. AGB sequential learning exhibits the poorest accuracy for predicting solar radiation, with the highest scores being obtained for MSE, RMSE, and MAE, as well as the lowest R2 value. A similar performance for AGB was reported in ref. [20].

4.2. Evaluation Heterogeneous Ensemble Learning

Four homogeneous ensemble learning models named RF, ET, GB, and HGB, were combined to build the voting and stacking methods. Voting was configured as a simple average of the individual predictions, without assigning weights. For stacking, the configuration was as follows: the first layer consisted of the homogeneous learning models RF, ET, GB, and HGB, while linear regression was adopted as the meta-model in the second layer, receiving the predictions from the base learner in the first layer as inputs to generate new predictions. Based on the performance metrics, both voting and stacking outperformed the seven homogeneous ensembles across all applied metrics (Table 6), demonstrating superior effectiveness in predicting solar radiation, compared to HGB. However, the difference in evaluation metrics between stacking and the HGB sequential ensemble is not very pronounced. Therefore, it is necessary to assess whether the performance benefits of stacking justify the computational cost of training the models in the first layer. Considering computational cost and training time as constraints, the sequential HGB ensemble could be a better option.

The comparison of predicted versus measured solar radiation for voting and stacking is illustrated in Figure 12a,b. Overall, stacking offers superior predictive ability compared to voting.

Voting and stacking have very similar values for MAE and MAPE, with slightly higher values for stacking. As such, both capture the main tendency of the dataset, exhibiting similar performance. Stacking outperforms voting in terms of MSE, RMSE, and R2. In fact, stacking represents the most powerful predictive ensemble in terms of accuracy, data fit, and performance.

4.3. Generalization Capability

Stacking, built by combining the first layer of the homogeneous ensembles RF, ET, GB, and HGB and the linear regression in the second layer, provides the best prediction performance, based on evaluation metrics (Table 6). To examine the generalization capability of stacking, samples were extracted from the test set (unseen test data) of the curated dataset to create different scenarios, where the ability of the model to capture the tendency of the measured solar radiation can be appreciated. In this context, the following three possible scenarios were proposed: (1) a day with relatively good solar radiation availability, (2) a day with scarce solar radiation, (3) a week with mixed behavior of the solar radiation.

The first scenario, shown in Figure 13a, demonstrates that the stacking algorithm efficiently tracks the measured solar radiation trend. In the second scenario (Figure 13b), stacking works effectively for predicting the fluctuations associated with a day with poor solar resource availability. Finally, in the mixed scenario (Figure 13c), stacking successfully captures all possible behaviors of solar radiation.

5. Conclusions

This study evaluated the performance of nine ensemble learning algorithms for predicting global solar radiation in Santo Domingo, using a local climate dataset with a 1 min horizon sampling time. MSE, RMSE, rRMSE, MAE, MAPE, and R2 were used as statistical metrics to determine the prediction effectiveness of the ensembles. The following findings can be summarized:

Solar radiation measurements were distributed as follows: approximately 75.67% were captured when the wind was blowing from the north (N) direction, 23.75% corresponded to the north–northeast (NNE) wind direction, and only 0.09% were taken from the northeast (NE) wind direction. The maximum average value of solar radiation was obtained from 12:30 p.m. to 1:30 p.m. (the sixth hour of the daily solar sample), with a value of 676.45 W/m².
The Recursive Feature Elimination (RFE) method with Random Forest (RF) as the external model was the best method for selecting the subset of input features for the training process, outperforming the Pearson, univariate (SelectKBest), and Sequential Feature Selection (SFS) methods in terms of the R2 score.
Of the seven homogeneous ensembles evaluated, histogram-based gradient boosting (HGB) reported the best performance metrics and consumed the least computational time. In general, the sequential homogeneous ensembles provided better predictive power performance compared to the parallel ensembles.
The parallel ensembles, Extra Tree (ET) and Random Forest (RF), were compared. ET excelled in accuracy and performance, with metrics resulting in MSE = 3795.275, RMSE = 61.606, rRMSE = 13.792, MAE = 30.722, MAPE = 8.40, and R2 = 0.9584. However, ET consumed about twice the training time of RF. Based on this evidence, ET is a better prediction option compared to RF when time consumption and computational resources are not a restriction.
Overall, the stacking ensemble algorithm, built by combining Random Forest (RF), Extra Tree (ET), gradient boosting (GB), and histogram-based gradient boosting (HGB) methods in the first layer, while using linear regression in the second layer, provided superior accuracy and prediction performance, with evaluation metric values of MSE = 3218.265, RMSE = 56.730, rRMSE = 12.700, MAE = 29.872, MAPE = 10.60, and R2 = 0.9645. However, it is heavily penalized by the computational cost of the training procedures, especially in the first layer. Therefore, if computational cost is considered a critical constraint, the homogeneous ensemble histogram-based gradient boosting (HGB) could be an excellent alternative, as it offers similar metrics (MSE = 3308.874 RMSE = 57.523, rRMSE = 12.878, MAE = 30.839, MAPE = 10.7, R2 = 0.9631) to stacking and requires the lowest computational cost.
In general, the developed ensemble learning algorithms proved to be powerful tools for predicting global solar radiation in Santo Domingo, located in the Caribbean region, which is characterized by a tropical climate. They effectively captured the trend of solar radiation with excellent accuracy.

Author Contributions

Conceptualization, F.A.R.-R. and N.F.G.-R.; methodology, F.A.R.-R.; software, F.A.R.-R.; validation, F.A.R.-R. and N.F.G.-R.; formal analysis, F.A.R.-R. and N.F.G.-R.; investigation, F.A.R.-R. and N.F.G.-R.; resources, F.A.R.-R. and N.F.G.-R.; data curation, F.A.R.-R.; writing—original draft preparation, F.A.R.-R. and N.F.G.-R.; writing—review and editing, F.A.R.-R. and N.F.G.-R.; visualization, N.F.G.-R.; supervision, N.F.G.-R.; project administration, N.F.G.-R.; funding acquisition, F.A.R.-R. and N.F.G.-R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by MESCyT (Ministry of Higher Education Science and Technology) in the Dominican Republic through Fondocyt, under the projects the Design of Control Strategies to Improve Energy Quality in Grid-connected Photovoltaic Generators (2020-2021-3C3-072) and the Development of Methodologies Based on Solar–Photovoltaic Green Hydrogen to Stabilize the Electrical Grid and Reduce the Carbon Footprint for Electrical Generation (2022-3C1-168).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The climate database presented in this article is not available because it is being used in future studies. For more information, please contact Francisco A. Ramirez.

Acknowledgments

The authors would like to thank the Ministry of Higher Education, Science and Technology (MESCyT) for promoting the development of research in the Dominican Republic.

Conflicts of Interest

The authors declare no conflicts of interest.

References

UNFCCC; Conference of the Parties (COP). Adoption of the Paris Agreement. Proposal by the President. In Proceedings of the Paris Climate Change Conference—COP 21, Paris, France, 30 November–12 December 2015. [Google Scholar]
COP28 UN Climate Change Conference—United Arab Emirates|UNFCCC. Available online: https://unfccc.int/cop28 (accessed on 9 June 2024).
IEA. Renewables 2023 Analysis and Forecast to 2028; IEA: Paris, France, 2024. [Google Scholar]
Comisión Nacional de Energía (CNE). Plan Energético Nacional 2022–2036; CNE: Santo Domingo, Dominican Republic, 2022. [Google Scholar]
Consultoría Jurídica del Poder Ejecutivo. Ley Núm. 57-07 Sobre Incentivo Al Desarrollo de Fuentes Renovables de Energía y de Sus Regímenes Especiales. 2007. Available online: https://biblioteca.enj.org/handle/123456789/79969 (accessed on 27 July 2024).
Consultoría Jurídica del Poder Ejecutivo. Ley Núm. 1-12 Que Establece La Estrategia Nacional de Desarrollo 2030. Available online: https://biblioteca.enj.org/handle/123456789/79975 (accessed on 27 July 2024).
Kumar, D.S.; Yagli, G.M.; Kashyap, M.; Srinivasan, D. Solar Irradiance Resource and Forecasting: A Comprehensive Review. IET Renew. Power Gener. 2020, 14, 1641–1656. [Google Scholar] [CrossRef]
Panda, S.; Dhaka, R.K.; Panda, B.; Pradhan, A.; Jena, C.; Nanda, L. A Review on Application of Machine Learning in Solar Energy Photovoltaic Generation Prediction. In Proceedings of the 2022 International Conference on Electronics and Renewable Systems (ICEARS), Tuticorin, India, 16–18 March 2022; pp. 1180–1184. [Google Scholar] [CrossRef]
Krishnan, N.; Kumar, K.R.; Inda, C.S. How Solar Radiation Forecasting Impacts the Utilization of Solar Energy: A Critical Review. J. Clean. Prod. 2023, 388, 135860. [Google Scholar] [CrossRef]
Voyant, C.; Notton, G.; Kalogirou, S.; Nivet, M.L.; Paoli, C.; Motte, F.; Fouilloy, A. Machine Learning Methods for Solar Radiation Forecasting: A Review. Renew. Energy 2017, 105, 569–582. [Google Scholar] [CrossRef]
Guerrero, J.M.; Ponci, F.; Leligou, H.C.; Peñalvo-López, E.; Psomopoulos, C.S.; Sudharshan, K.; Naveen, C.; Vishnuram, P.; Venkata, D.; Krishna, S.; et al. Systematic Review on Impact of Different Irradiance Forecasting Techniques for Solar Energy Prediction. Energies 2022, 15, 6267. [Google Scholar] [CrossRef]
Rahimi, N.; Park, S.; Choi, W.; Oh, B.; Kim, S.; Cho, Y.; Ahn, S.; Chong, C.; Kim, D.; Jin, C.; et al. A Comprehensive Review on Ensemble Solar Power Forecasting Algorithms. J. Electr. Eng. Technol. 2023, 18, 719–733. [Google Scholar] [CrossRef]
Raza, M.Q.; Nadarajah, M.; Ekanayake, C. On Recent Advances in PV Output Power Forecast. Sol. Energy 2016, 136, 125–144. [Google Scholar] [CrossRef]
Kunapuli, G.; Olstein, K. Ensemble Methods for Machine Learning; Olstein, K., Miller, K., Eds.; Manning Publications Co.: Shelter Island, NY, USA, 2023; ISBN 9781617297137. [Google Scholar]
Hassan, M.A.; Khalil, A.; Kaseb, S.; Kassem, M.A. Exploring the Potential of Tree-Based Ensemble Methods in Solar Radiation Modeling. Appl. Energy 2017, 203, 897–916. [Google Scholar] [CrossRef]
Benali, L.; Notton, G.; Fouilloy, A.; Voyant, C.; Dizene, R. Solar Radiation Forecasting Using Artificial Neural Network and Random Forest Methods: Application to Normal Beam, Horizontal Diffuse and Global Components. Renew. Energy 2019, 132, 871–884. [Google Scholar] [CrossRef]
Park, J.; Moon, J.; Jung, S.; Hwang, E. Multistep-Ahead Solar Radiation Forecasting Scheme Based on the Light Gradient Boosting Machine: A Case Study of Jeju Island. Remote Sens. 2020, 12, 2271. [Google Scholar] [CrossRef]
Lee, J.; Wang, W.; Harrou, F.; Sun, Y. Reliable Solar Irradiance Prediction Using Ensemble Learning-Based Models: A Comparative Study. Energy Convers. Manag. 2020, 208, 112582. [Google Scholar] [CrossRef]
Kumari, P.; Toshniwal, D. Extreme Gradient Boosting and Deep Neural Network Based Ensemble Learning Approach to Forecast Hourly Solar Irradiance. J. Clean. Prod. 2021, 279, 123285. [Google Scholar] [CrossRef]
Huang, L.; Kang, J.; Wan, M.; Fang, L.; Zhang, C.; Zeng, Z. Solar Radiation Prediction Using Different Machine Learning Algorithms and Implications for Extreme Climate Events. Front. Earth Sci. 2021, 9, 596860. [Google Scholar] [CrossRef]
Alam, M.S.; Al-Ismail, F.S.; Hossain, M.S.; Rahman, S.M. Ensemble Machine-Learning Models for Accurate Prediction of Solar Irradiation in Bangladesh. Processes 2023, 11, 908. [Google Scholar] [CrossRef]
Solano, E.S.; Affonso, C.M. Solar Irradiation Forecasting Using Ensemble Voting Based on Machine Learning Algorithms. Sustain. 2023, 15, 7943. [Google Scholar] [CrossRef]
Mohammed, A.; Kora, R. A Comprehensive Review on Ensemble Deep Learning: Opportunities and Challenges. J. King Saud Univ. Comput. Inf. Sci. 2023, 35, 757–774. [Google Scholar] [CrossRef]
González, S.; García, S.; Del Ser, J.; Rokach, L.; Herrera, F. A Practical Tutorial on Bagging and Boosting Based Ensembles for Machine Learning: Algorithms, Software Tools, Performance Study, Practical Perspectives and Opportunities. Inf. Fusion 2020, 64, 205–237. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, J.; Shen, W. A Review of Ensemble Learning Algorithms Used in Remote Sensing Applications. Appl. Sci. 2022, 12, 8654. [Google Scholar] [CrossRef]
Geurts, P.; Ernst, D.; Wehenkel, L. Extremely Randomized Trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef]
Khan, A.A.; Chaudhari, O.; Chandra, R. A Review of Ensemble Learning and Data Augmentation Models for Class Imbalanced Problems: Combination, Implementation and Evaluation. Expert Syst. Appl. 2024, 244, 122778. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. Experiments with a New Boosting Algorithm. Icml 1996, 96, 148–156. [Google Scholar]
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Friedman, J.H. Stochastic Gradient Boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]
Natekin, A.; Knoll, A. Gradient Boosting Machines, a Tutorial. Front. Neurorobot. 2013, 7, 63623. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Tibshirani, R., Hastie, T., Eds.; Springer Series in Statistics; Springer: New York, NY, USA, 2009; ISBN 9780387848587. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A Highly Efficient Gradient Boosting Decision Tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3146–3154. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Scikit-learn. Histogram-Based Gradient Boosting Regression Tree. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html (accessed on 27 July 2024).
Wolpert, D.H. Stacked Generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
Li, Y.; Chen, W. A Comparative Performance Assessment of Ensemble Learning for Credit Scoring. Mathematics 2020, 8, 1756. [Google Scholar] [CrossRef]
Ruiz-Valero, L.; Arranz, B.; Faxas-Guzmán, J.; Flores-Sasso, V.; Medina-Lagrange, O.; Ferreira, J. Monitoring of a Living Wall System in Santo Domingo, Dominican Republic, as a Strategy to Reduce the Urban Heat Island. Buildings 2023, 13, 1222. [Google Scholar] [CrossRef]
Pena, J.C.; Gordillo, G. Photovoltaic Energy in the Dominican Republic: Current Status, Policies, Currently Implemented Projects, and Plans for the Future. Int. J. Energy Environ. Econ 2020, 26, 270–284. [Google Scholar]
The World Bank (2020)-Source: Global Solar Atlas 2.0-Solar Resource Data: Solargis. Solar Resource Maps of Dominican Republic. Available online: https://solargis.com/maps-and-gis-data/download/dominican-republic (accessed on 6 June 2024).
Dhal, P.; Azad, C. A Comprehensive Survey on Feature Selection in the Various Fields of Machine Learning. Appl. Intell. 2021, 52, 4543–4581. [Google Scholar] [CrossRef]
Elgeldawi, E.; Sayed, A.; Galal, A.R.; Zaki, A.M. Hyperparameter Tuning for Machine Learning Algorithms Used for Arabic Sentiment Analysis. Informatics 2021, 8, 79. [Google Scholar] [CrossRef]
Li, M.F.; Tang, X.P.; Wu, W.; Liu, H. Bin General Models for Estimating Daily Global Solar Radiation for Different Solar Radiation Zones in Mainland China. Energy Convers. Manag. 2013, 70, 139–148. [Google Scholar] [CrossRef]
Despotovic, M.; Nedic, V.; Despotovic, D.; Cvetanovic, S. Evaluation of Empirical Models for Predicting Monthly Mean Horizontal Diffuse Solar Radiation. Renew. Sustain. Energy Rev. 2016, 56, 246–260. [Google Scholar] [CrossRef]

Figure 1. Structure of homogeneous ensemble learning.

Figure 2. Sketch of the proposed flow process for predicting solar radiation using ensemble learning algorithms.

Figure 3. GHI of Dominican Republic. Map provided by the World Bank Group—Solargis [42].

Figure 4. Weather station mounted on the roof of FCSI building.

Figure 5. Solar radiation: (a) grouped by wind direction (WD); (b) amount of observations by wind direction (WD).

Figure 6. Distribution of the solar radiation: (a) observations by wind direction for daily sunlight hours; (b) average values for daily sunlight hours.

Figure 7. Pearson correlation matrix for all database.

Figure 8. Correlation matrix coefficient using Pearson after removing input parameters.

Figure 9. Results of feature subset selection: (a) selected by Pearson coefficient; (b) relevance of features generated by RFE: only the most relevant were selected.

Figure 10. Standardized distribution curve of the subset features selected.

Figure 11. Comparison between measured and predicted values solar radiation values with seven homogeneous ensemble learning; (a) RF; (b) ET; (c) XGB; (d) GB; (e) AGB; (f) HGB; (g) LGBM.

Figure 12. Evaluation of the measured vs. predicted solar radiation values, with two heterogeneous ensembles; (a) stacking; (b) voting.

Figure 13. Ability of the heterogenous Stacking ensemble to capture the tendency of solar radiation in several scenarios: (a) a day with good solar radiation (date: 5 May 2022); (b) a day of scarce solar radiation (8 May 2022); (c) a week with mixed behavior of the solar radiation (7–14 May 2022).

Table 1. A review focusing on recent studies using ensemble learning to predict solar radiation.

Refs.	Location	Feature Selected	Ensemble Algorithm Test	Time Horizon	Data	Periods	Metric	Complete Preprocessing
[15]	Cairo, Ma’an, Ghardaia’ Tataouine, Tan-Tan	4	BG, GB, RF, SVM *, MLP-NN	1 h	71,499	2010 to 2013	MBE, R2 RMSE	✗
[15]	Cairo, Ma’an, Ghardaia’ Tataouine, Tan-Tan	3	BG, GB, RF, SVM *, MLP-NN	1 day	7906	2010 to 2013	MBE, R2 RMSE	✗
[16]	Odeillo (France)		SP, MLP-NN, RF *	1 to 6 h	10,559	3 years	MAE, RMSE, nRMSE, nMAE	✗
[17]	Jeju Island (South Korea)	330	LGBM, RF, GB, DNN	1 h	32,134	2011 to 2018	MBE, RMSE, MAE, NRMSE	✗
[18]	California, Texas, Washington, Florida, Pennsylvania, Minnesota	9	BS, BG, RF, GRF *, SVM, GPR	1 h	-	A year (TMY3)	RMSE, MAPE, R2	✗
[19]	New Delhi, Jaipur, Gangtok	8	Stacking * (XGB + DNN)	1 h	-	2005 to 2014	RMSE, MBE, R2	✓
[20]	Bangladesh	7	GB *, AGB, RF, BG,	-	3060	1999 to 2017	MAPE, RMSE, MAE, R2	✗
[21]	Ganzhou	10	GB , XGB , AB, RF, SVM, ELM, DT, KNN, MLR, RBFNN, BPNN	1 day	13,100	1980 to 2016	RMSE, MAE, R2	✗
[21]	Ganzhou	10	GB , XGB , AB, RF, SVM, ELM, DT, KNN, MLR, RBFNN, BPNN	1 months	432	1980 to 2016	RMSE, MAE, R2	✗
[22]	El Salvador (Brazilian)	9	Voting *, XGB, RF CatBoost, AdaBoost,	1 to 12 h			MAE, MAPE, RMSE, R2	✓
This Work	Santo Domingo	8	RF, ET, GB, XGB, HGB , LGBM, Voting, Stacking	1 min	78,536	5 months (2022)	R2, MSE, RMSE, rRMSE MAE, MAPE	✓

* ML algorithm with the best prediction performance.

Table 2. Computational resources used to perform the simulation.

Model	Processors	Memory	Graphics Card	Hard Disk
Dell OptiPlex 7000	12th Intel Core i7-12700	32 GB DDR4	Intel Integrated Graphics	1 TB PCIe NVMe

Table 3. Characteristics of the parameters measured by the weather station.

Parameters /Features		Description	Specifications
Parameters /Features		Description	Range	Accuracy (+/−)
1	Date		month/day	8 s/ mon.
2	Time		24 h	8 s/ mon.
3	Temp Out	Outside (Ambient) Temperature	−40 °C to 65 °C	0.3 °C
4	Hi. Temp	High outside temperature recorded for a certain period
5	Low Temp	Low outside temperature recorded for a certain period
6	In Temp	Inside temperature/sensor located at the console	0 °C to 60 °C	0.3 °C
7	Out Hum	Outside relative humidity	1% to 100%	2% RH
8	In Hum	Inside relative humidity at the console	1% to 100%	2% RH
9	Dew Pt.	Dew point	−76 °C to 54 °C	1 °C
10	In Dew	Inside dew point at the console	−76 °C to 54 °C	1 °C
11	Wind Speed	Speed of the outside local wind	0 to 809 m/s	>1 m/s
12	Hi. Speed	High velocity of the outside wind recorded during the specified period	0 to 809 m/s	>1 m/s
13	Wind Dir	Wind direction	0° to 360°	3°
14	Hi. Dir	High wind direction recorded for a certain period	0° to 360°	3°
15	Wind Run	The “amount” of wind passing through the station/time
16	Wind Chill	Apparent temperature index calculated from wind speed and air temperature	−79 °C to 57 °C	1 °C
17	Heat Index	An apparent temperature index estimated by associated temperature and relative humidity to determine the perceived level of heat as it feels	−40 °C to 74°C	1 °C
18	THW Index	Use the temperature, humidity, wind to estimate the apparent index	−68 °C to 74 °C	2 °C
19	THSW Index	Combine the temperature, humidity, sun exposure, wind to estimate apparent temperature index (how it feels out in the sun)	−68 °C to 74 °C	2 °C
20	Bar	Barometric pressure	540 to 1100 mb	1.0 mb
21	Rain	The amount of rainfall daily/monthly/yearly	to 6553 mm	>4%
22	Rain Rate	Rainfall intensity	to 2438 mm/h	>5%
23	Solar Rad.	Solar radiation, including both the direct and diffuse components	0 to 1800 W/m²	5% FS
24	Hi. Solar Rad.	High solar radiation recorded for a certain period	0 to 1800 W/m²
25	Solar Energy	The rate of solar radiation accumulated over a time
26	Heat D-D	Heating degree day
27	Cool D-D	Cooling degree days
28	In Heat	Inside heat index, where the console is located	−40 °C to 74 °C
29	In EMC	Inside electromagnetic compatibility
30	In Density	Inside air density at the console installation location	1 to 1.4 kg/m³	2% FS
31	ET	A measurement of the amount of water vapor returned to the air in a specific area through both evaporation and transpiration	to 1999.9 mm	>5%
32	Wind Samp	wind speed samples in “Arc Int” amount of time
33	Wind Tx	RF channel for wind data
34	ISS Recept	%—RF reception
35	Arc. Int.	Archival interval in minutes

Table 4. Evaluation results of the five subset feature selection methods.

Selection Methods	Subset of Feature Selected	Characteristics	Ensemble Learning Algorithms	Score R2 (Test Set)
Pearson	Temp Out, Out Hum, Dew Pt., THSW Index, Bar, Rain, In Temp, In Density	$ρ$ > 0.1 and $ρ$ < −0.1	GB	0.924
			AGB	0.822
			XGB	0.958
			ET	0.958
			RF	0.954
RFE	In Temp, In Density, Out Hum, Bar, THSW Index,Wind Speed,Dew Pt.,Temp Out	External ML algorithm = RF, RF = {n_estimators:350, criterion:squared_error, max_depth:15, max_features:sqrt}	GB	0.930
			AGB	0.843
			XGB	0.962
			ET	0.964
			RF	0.960
SKBest	Temp Out, Out Hum, Dew Pt., Wind Speed, THSW Index, Bar, Rain, In Temp	Score function = Regression Number feature to select = 8	GB	0.929
			AGB	0.843
			XGB	0.962
			ET	0.964
			RF	0.959
SFS-FW	Temp Out, Out Hum, Dew Pt., Wind Speed, THSW Index, In Density, WD_NNE, HD_N	External ML algorithm = LR, direction = forward, scoring = R2, cross-validation = kfold, Kfold = {folds = 5, shuffle = NO}	GB	0.930
			AGB	0.833
			XGB	0.962
			ET	0.961
			RF	0.957
SFS-BW	Temp Out, Out Hum, Dew Pt., Wind Speed, THSW Index, Bar, In Temp, In Density	External ML algorithm = LR, direction = backward, scoring = R2, cross, validation = kfold, Kfold = {folds = 5, shuffle = NO}	GB	0.930
			AGB	0.833
			XGB	0.962
			ET	0.962
			RF	0.957

Table 5. Specification of the hyperparameter turning process for ensemble learning algorithms.

Algorithms	Iteration (n_iter)/ Cores	Appropriate Hyperparameters	Computational Cost(s)
RF	1000/8	n_estimators: 1160, max_features: 8, min_samples_leaf: 7, max_depth: 17, min_samples_split: 10	51,605.280
ET	1000/8	n_estimators: 630, min_samples_split: 10, min_samples_leaf: 1, max_depth: 23 max_features: 8	126,830.010
AGB	250/8	n_estimators: 100, loss: exponential, learning_rate: 0.201	10,877.330
GB	1500/8	n_estimators: 2200, min_weight_fraction_leaf: 0, min_samples_split: 250, min_samples_leaf: 40, max_leaf_nodes: 10, max_features: 8, max_depth: 18, loss: huber, learning_rate: 0.101, criterion: friedman_mse, alpha: 0.210, tol: 1 × 10⁻⁶, subsample: 0.1,	16,221.750
XGB	1500/8	tree_method = hist, n_estimators: 2600, subsample: 0.9, scale_pos_weight: 0.05, reg_lambda: 0.89, reg_alpha: 0.2, min_child_weight: 10, max_depth: 5, learning_rate’: 0.01, gamma’: 0.05, colsample_bytree’: 0.79	56,941.400
HGB	1500/8	quantile: 1, min_samples_leaf: 49, max_iter: 680, max_depth: 5, loss: absolute_error, learning_rate: 0.101, l2_regularization: 0.0	10,296.756
LIGHTBM	1500/8	n_estimators: 2200, boosting_type:dart, subsample_freq: 4, subsample’: 0.5, reg_lambda: 2.40, reg_alpha’: 0.0, num_leaves: 31, min_sum_hessian_in_leaf’: 19, min_data_in_leaf’: 21, max_depth: 10, max_bin: 70, learning_rate: 0.1, colsample_bytree: 0.5, bagging_seed: 96, bagging_freq: 6, bagging_fraction’: 0.3, objective:‘regression’, force_row_wise:True,	126,833.010
Voting	/8	Average output results:{ HGB,ET,GB,RF}	900.541
Stacking	/8	Combining algorithm-Layer 1:{ HGB,ET,GB,RF}, Layer-2:{LinearRegressor} Cross-validation: kfold{five folds without shuffle}	2100.780

Table 6. Evaluation metrics for each ensemble learning model built.

Ensemble Learning	Test Set
	Evaluation Metrics
	MSE [W2/m⁴]	RMSE [W/m²]	rRMSE [%]	MAE [W/m²]	MAPE [%]	R2 [-]
RF	4243.296	65.141	14.583	33.745	9.20	0.9538
ET	3795.275	61.606	13.792	30.722	8.40	0.9584
XGB	3515.760	59.294	13.274	33.460	12.90	0.9608
AGB	8739.339	93.484	20.929	70.992	49.11	0.9027
GB	3499.137	59.154	13.243	31.977	11.8	0.9610
HGB	3308.874	57.523	12.878	30.839	10.7	0.9631
LGBM	3494.692	59.116	13.234	33.883	16.00	0.9611
Stacking	3218.265	56.730	12.700	29.872	10.60	0.9645
Voting	3346.470	57.849	12.951	29.220	10.40	0.9627

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ramírez-Rivera, F.A.; Guerrero-Rodríguez, N.F. Ensemble Learning Algorithms for Solar Radiation Prediction in Santo Domingo: Measurements and Evaluation. Sustainability 2024, 16, 8015. https://doi.org/10.3390/su16188015

AMA Style

Ramírez-Rivera FA, Guerrero-Rodríguez NF. Ensemble Learning Algorithms for Solar Radiation Prediction in Santo Domingo: Measurements and Evaluation. Sustainability. 2024; 16(18):8015. https://doi.org/10.3390/su16188015

Chicago/Turabian Style

Ramírez-Rivera, Francisco A., and Néstor F. Guerrero-Rodríguez. 2024. "Ensemble Learning Algorithms for Solar Radiation Prediction in Santo Domingo: Measurements and Evaluation" Sustainability 16, no. 18: 8015. https://doi.org/10.3390/su16188015

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ensemble Learning Algorithms for Solar Radiation Prediction in Santo Domingo: Measurements and Evaluation

Abstract

1. Introduction

2. Description of Ensemble Learning Algorithms

2.1. Parallel Homogeneous Ensemble

2.1.1. Random Forest (RF)

2.1.2. Extremely Randomized Trees (ET)

2.2. Sequential Homogeneous Ensemble

2.2.1. Adaptive Boosting (AB)

2.2.2. Gradient Boosting (GB)

2.2.3. Extreme Gradient Boosting (XGB)

2.2.4. LightGBM (LGBM)

2.2.5. Histogram-Based Gradient Boosting (HGB)

2.3. Heterogeneous Ensemble Learning

2.3.1. Voting

2.3.2. Stacked Generalization

3. Materials and Methods

3.1. Data Collection

3.2. Data Preprocessing and Analysis

3.3. Splitting the Dataset

3.4. Standardization of the Dataset

3.5. Feature Selection

3.5.1. The Pearson Coefficient

3.5.2. Recursive Feature Elimination (RFE)

3.5.3. SelectKBest (SKBest)

3.5.4. Sequential Future Selection (SFS)

3.6. Training Process

3.7. Evaluation Metrics

4. Discussion and Results

4.1. Evaluation Homogeneous Ensemble Learning

4.2. Evaluation Heterogeneous Ensemble Learning

4.3. Generalization Capability

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI