1. Introduction
Growing global interest in reducing the environmental pollution created by heavy reliance on oil derivatives for the production of electric power has motivated governments to take significant steps toward the implementation of renewable energy. One of the most important renewable energy sources is wind, with the 2019 total world capacity of wind energy estimated to be 650 gigawatts [
1] and the annual global increase in wind energy calculated at 20% [
2]. This expansion has resulted in wind energy technology becoming a principal source of energy in terms of sales and technical development. In spite of these advances, this energy resource remains unreliable at high rates, and increasing dependence on this technology is associated with the emergence of numerous problems for electrical system operators. Examples of these challenges are the substantial changes in wind production arising from the random behavior of wind speeds, as well as the difficulty of accurately forecasting wind production, which gives rise to many issues during the operation of the electric grid. Any decision to increase the use of wind energy hence requires careful planning along with highly reliable methods of making rational and informed decisions [
3,
4].
To analyze and evaluate the effects of inconsistent wind behavior on the reliability and stability of an electrical grid as well as on short-term operation and long-term planning, several researchers have applied and reported probability-based methods. For example, in [
5,
6] a sequential Monte Carlo simulation (SMCS) method was used for representing the probability distribution and time-series characteristics of wind speed. Another efficient method is a Monte Carlo Markov chain (MCMC) method [
7,
8], which is based on the dependence of the wind speed at a given point in time on the speed during the previous moment. This feature makes this method effective for preserving the chronological characteristics of wind speed. Some studies [
9,
10,
11] have also dealt with correlations between the output levels of wind turbines installed in separate geographical areas or between those of multiple wind farms in adjacent areas. These studies led to the conclusion that a determination of the type of correlation (positive, negative, or zero) is related to several factors, including the way the turbines are arranged on the site and the method employed for connecting the turbines with one another as well as with the electrical network.
An examination of the correlation between the output levels of distant wind energy sites is not usually of interest to researchers because the relationship is often a zero or an inverse correlation. However, we believe that reconsidering this factor is very important, especially with respect to the correlations among multiple wind energy sites in different regions of the same country or in different countries, which might be interconnected in an electrical network. Conducting such studies would offer several advantages: (1) Knowing the diversity of and variations in wind energy production from different sites would be beneficial for grid operation in terms of power quantity and time of supply. (2) Prior knowledge of the amount of variation and the type of correlation, even if negative, would help network operators achieve effective management of grid operations, such as load flow and network stability. (3) Identification of the potential of wind energy in each region of a country is crucial information for investors or decision makers.
The results of such a study would be very important for countries that feature large areas and substantial regional diversity. With an area of 2.25 million square meters encompassing regions that exhibit varied environmental characteristics, the Kingdom of Saudi Arabia (KSA) is one such country. The KSA is also one of the largest countries in the Middle East, and most of the nearby countries rely on the KSA for resilient grid interconnections for ensuring power security and economic benefits. The Saudi government is taking rapid steps toward diversification of energy sources and is investing heavily in sustainable energy. This trend is one of the main priorities and objectives of the KSA’s Vision 2030. One of the most important of these subsidized projects is wind energy, since it is expected that wind energy capacity will reach 9.5 gigawatts by 2030 [
12]. In 2018, the Renewable Energy Development Office (REDO) nominated about 50 companies to begin implementing the planned renewable projects [
12], which include solar power stations with a capacity of 300 megawatts and wind farms with a capacity of 400 megawatts. These stations are to be operating and connected to the electric grid before the end of 2021. The Saudi government recently announced new renewable energy projects estimated at
$50 billion, with implementation expected to be completed in 2023. Establishing such projects requires accurate technical and economic studies so that suitable construction locations can be determined.
In the past few years, numerous studies dealing with wind energy in Saudi Arabia have appeared in the literature. As reported in [
13,
14,
15], several studies involved the analysis of statistical parameters associated with different wind farm sites and the extraction of Weibull distribution parameters for each individual site. The limitation of these studies is that their findings with respect to site productivity were dependent on the overall assessment of the available wind speed data for each site. Based on this method, the evaluation might indicate that a site is currently unsuitable for a wind project, but that site might in fact be considered a good choice for a specific period. These studies also relied on the assumption that an appropriate distribution for all sites is a Weibull distribution. Since such an assumption is neither accurate nor valid for all sites, the results could be over-approximations, according to [
16,
17,
18]. To the best of our knowledge, no study has taken into account either wind speed data collected for different, distanced locations or the processing of those data as a single package to maintain the characteristics of the correlations among locations and thus to provide more accurate and detailed standard measures of wind speed productivity at those locations.
Addressing this point represents the core contribution of the work presented in this paper. Data mining techniques have recently been used in numerous applications because of the benefits these techniques offer with respect to developing models and making decisions.
Several studies have employed artificial intelligence techniques for renewable systems. For example, artificial neural networks are used in [
19] to characterize PV modules. Application of data mining procedures that include support vector machines and fuzzy logic is also applied in several studies. In [
20], a new methodology combining both Gaussian-kernel support vector machine and adaptive fuzzy inference system is developed. This methodology extracts the fuzzy rules directly from the training data to be used in the testing stage. In [
21,
22], EEG signals are analyzed using SVM, ANN, Naïve Bayes, and decision trees for epilepsy detection. In [
23], authors have used the decision tree technique to detect adverse drug reactions and the system was optimized using a genetic algorithm. An efficient feature selection method was developed in [
24] for enhancing Arabic text classification. In [
25,
26,
27], texture classification techniques are developed based on independent component analysis and naïve base classifier.
In this study, a decision tree algorithm is used and the major contributions of this study in comparison to existing studies are as follows:
A unique and unified method for predicting wind speeds at diversified locations in the KSA is proposed. The proposed model enables the examination of deviations and correlations of wind speeds at different locations.
A model is developed that deals and examines an extensive range of data for a variety of sites. In addition, conclusions about the characteristics of these data using the least possible number of classifications can also be drawn to facilitate the understanding of the data and to expedite their use. The goal was to help decision makers arrive at quick, accurate, and informed decisions.
Finally, the capability of the assessed locations can be ranked to enable system operators to ascertain in advance the monthly productivity of each site so that they can implement appropriate planning and operating actions.
2. System Design and Methodology
This section provides details about the developed prediction system, which is based on a decision tree algorithm. Numerous decision tree algorithms are currently available, including random forests, random trees, the J48, and classification and regression trees (CART). A decision tree algorithm employs training data to build a tree model that is used for classification purposes. The developed classification algorithm involves three phases: data gathering, data preprocessing, and learning and classification. In the data-gathering phase, the training and test set is collected from wind station databases. The second phase involves the preprocessing of the data, including outlier detection and elimination, missing data treatment, and averaging. In the learning and classification phase, the goal is to develop an intelligent decision mechanism. A test set is then applied for determining the accuracy of the developed model.
2.1. Data-Gathering Phase
The five locations whose wind speed data were examined in this study were carefully selected to include all regions of the KSA [
28]. Five sites were chosen to be representative of each region: center, east, west, south, and north. The selection corresponds to the operational divisions of the Saudi Arabian electrical system.
Figure 1 shows the sites where the data were collected.
Table 1 provides a statistical summary of the data collected for each site. These statistics are a collection of indices that provide meaningful information regarding the location and variability of the data. To facilitate their interpretation, brief definitions of some of the statistics are given here [
29]. The most common indicator of the central tendency of a random variable is the mean, which represents the average number of data points. For the selected sites, it can be noted that the means are about 3 m/s to 4 m/s, with the exception of the east region, where 1.9 m/s is the recorded mean. The standard error (SE) is the measure that indicates how close the mean of the sampled data is to the true population mean. An SE of 0.05 or less implies that the sample data are quite similar to those for the whole population, with a confidence level of 95%. As can be observed from a review of the results, the SE values for all sites are less than 5%, so the data sample for each site is thus large enough to represent the true population. The median is another measure of central tendency, and the mode refers to the most frequently or commonly occurring number in the data. Standard deviation and variance denote the spread of the data distribution. Kurtosis identifies whether the tails of a given distribution contain extreme values. Skewness is the measure of the symmetry of distribution, and it differentiates extreme values in one versus the other tail. The minimum is the smallest value in the data set while the maximum is the largest value in the data set. The sum shows the summation of the wind speeds of all data sets. The count shows how many items the data have. The results listed in
Table 1 reveal noticeable differences among the statistical values associated with different sites. These discrepancies were expected due to the divergent distances between the sites and the diverse nature of the local weather.
The data is a part of the Renewable Resource Monitoring and Mapping (RRMM) program prepared by King Abdullah City for Atomic and Renewable Energy (KACARE). KACARE monitored and recorded the wind speed data at different installed stations in the Kingdom of Saudi Arabia at 3 m height.
Table 2 provides an example of data for one of the five sites. The size of the sample is associated with the amount of information provided and the determination of the precision or level of confidence about the desired estimate. Wind speed estimate always has an associated level of uncertainty, which depends upon the underlying variability of the data as well as the sample size: the smaller the sample size, the greater the uncertainty in the estimate. Similarly, a larger sample size can provide more information, thus the uncertainty is reduced. In this study, the sample size in all selected sites ranges from 19,000 to 25,000 data points. We tried to collect this large sample size to reduce the amount of uncertainty associated with the estimate and achieve reasonable results. The steps involved in the proposed model through the Weka tool consider different concepts of data mining, which are as follows. First, the Weka software allows preprocessing step for raw data to detect the outliers and irrelevant data by cleaning and clustering the data using the k- means technique. In addition, the data mining techniques cater to the uncertainty. This is noticed in the used decision tree methodology when applying the Gini impurity measure to decide the optimal split from a root node and subsequent splits. The Gini impurity measures the frequency at which any element of the dataset will be mislabeled when it is randomly labeled. The entropy is another way of measuring that is based on the selection of the optimum split for the features with less entropy.
A subset of the combined database is shown in
Table 3. The data were collected from 9 January 2013, to 31 December 2016. The subset consists of 34,872 records. The information in
Table 3 is only a small subset of the available database. Zero irradiances for the north region in this table were recorded at 4 and 5 am; this is normal at sunset time when the sun disappears.
2.2. Data Preprocessing Phase
Data preprocessing includes data cleaning and missing data treatment. In this phase, information not needed for the wind speed model, such as the irradiance and the latitude and longitude, are removed from the database. Wind speed data missing for a specific date are then replaced by the average value of the wind speeds for that day [
30,
31,
32,
33]. That date is eliminated and simply replaced by the corresponding month; i.e., 29/05/2013 10:00:00 is replaced by May, as shown in
Table 4.
The combined database is then rearranged to add an output label to a new set of input attributes. The new set of input attributes are defined as indicated in
Table 5: month, center wind speed, south wind speed, east wind speed, north wind speed, and west wind speed. The output attribute consists of multi-labeled data: case 1 to case 120. Since the number of locations is five, the resultant possible number of output cases is 5! = 120 possibilities.
With the use of an association rule algorithm [
34,
35,
36,
37], the number of possible cases can be decreased to eight. The association rule algorithm caters for the correlation between wind speeds in different areas.
The association algorithm can be summarized in the following steps:
Step 1: Generate all association rules in the form if {A,B,C,D,...} then {E,F,G,...}, where A, B, C, D, E, F, G,... are items.
Step 2: Calculate confidence of the generated rules, i.e., if A then B using:
Step 3: Calculate support of the generated rules, i.e., if A then B using:
Step 4: Check if support is less than a pre-defined threshold, i.e., minsup.
Step 5: Check if confidence is less than a pre-defined threshold, i.e., minconf
Step 6: Prune rules that fail the minsup and minconf thresholds.
The wind speed of each location is labeled using a rank-based system. The developed ranking system distributes wind speeds evenly, measuring them only relative to a given location, but not according to the real value of any given speed. The developed ranking-based system includes five labels that identify the level of the wind speed: very high (VH), high (H), medium (M), low (L), and very low (VL). The database resulting after the labels have been assigned based on the wind speed ranking is shown in
Table 6.
To minimize the number of output attributes, an association rule algorithm is applied for analyzing all of the relations between the cases.
Table 7 shows the resulting cases and the corresponding locations of the rules that produce support and confidence levels greater than a given minimal support threshold (minsup = 0.01) and a given minimal confidence threshold (minconf = 0.5).
Table 8 provides a sample of association rules with their support and confidence levels. The table shows the minimum number of cases that can be achieved using the association algorithm with a unity confidence level.
Figure 2 summarizes all of the steps described above for the data-gathering and preprocessing stages.
2.3. Learning and Classification Phase
Figure 3 displays a flowchart of the developed classification algorithm, which governs the processing of the data through three stages: training, testing, and validation. First, the training data are applied to the decision tree algorithm to obtain the initial model. For each iteration, the accuracy and precision are then calculated as a means of achieving the optimal model; the test data are applied so that the performance and efficiency of the model can be verified; and in the final step, the remaining verification data are employed to ensure that the results produced by the model have a high degree of accuracy and precision.
A decision tree partitions the input space of the dataset into mutually exclusive regions by assigning each region a label. The decision tree begins with a root node and ends with a leaf node [
23]. Multiple branches are formed between the root and the leaf nodes. The decision tree algorithm is performed based on splitting data into multiple regions and each region is divided into small parts. Furthermore, splitting continues until the terminal node reaches leaf nodes. The splitting is formed based on an impurity measure. Two common measures are used to obtain impurity values, Gini index, and entropy. In this paper, entropy is used as impurity measure that evaluates the homogeneity of the partition nodes too. The following steps summarizes the decision tree algorithm.
Step 1: the entropy of the root node with n branches is calculated as
where p is the fraction of records that belongs to class i at the node.
Step 2: the entropy of each partition with J sub classes is calculated as
Step 3: The branch entropy is calculated using the individual k partition entropies as
where n
i is the number of records at partition i,
n is number of records at branch, and
E is the entropy.
Step 4: The GAIN
Split which is used to decide the best partition is the best. The partition that produces the most reduction is chosen The GAIN
Split is shown below
where E is the entropy.
If all input attributes are used, the algorithm for decision tree induction is as shown in
Figure 4.
If the prediction order is requested for a specific month and the wind speeds are unavailable at that moment, the decision tree induction model shown in
Figure 5 is used. This model is based on a single input attribute: “month”.
A new model, Model 2, is implemented based on the output of the previous model, Model 1, as shown in
Figure 6. The implementation involves a comparison of the output for the five cases generated from the first model with that of the eight cases from the original training data. The output from these five cases along with the output from the original cases is then used as input to a similarity algorithm.
Next, the similarity algorithm measures the similarity score between the five cases and each case from the original data, i.e., Case 8 is similar to Case 4, Case 7 is similar to Case 3, and Case 6 is similar to Case 5. The algorithm relies on edit distance, which is a technique for quantifying how dissimilar two strings (e.g., words) are to one another based on a count of the minimum number of operations required to transform the first string into the second. The edit distance between two cases for the five locations is the minimum number of operations required for transforming one case into another case. For example, the edit distance between “case 1 case 2 case 3 case 4 case 5” and “case 1 case 3 case 2 case 4 case 5” is two. A flowchart of the similarity algorithm is shown in
Figure 7.
The resulting similarity pairs are employed for reprocessing the original training data through the replacement of the original cases with the similar cases, as shown in
Table 9 compared with
Table 6. The final step is that the resulting training data are applied for teaching Model 2, with the use of the decision tree as previously performed for developing Model 1. The degree of accuracy of Model 2 is then increased to 100%.
3. Experiments and Results
For this study, Waikato Environment for Knowledge Analysis (Weka) software was employed [
38] for constructing decision trees according to the training set, using the standard J48 algorithm [
39,
40,
41,
42]. This algorithm has been selected as one of the top 10 algorithms in data mining [
43]. Java was used as the development language with J2SDK version 1.6.0_22. Weka version 3.8.4 was employed for the experimental component of the model development.
The first use of Weka software is to do data pre-processing before applying machine learning algorithms on it. The wind speed data for selected sites are recalled from CSV files. This can be done by clicking the “Open file” button and loading the data file. The loaded dataset is then processed to Cross-validation to randomly partition the data into k subsamples for training and testing. The number entered in the Fold section is used to divide the dataset into the number of Folds specified. Then classifier J48 is used as a decision tree to create a pruned tree. The Classifier Model part illustrates the model as a tree and gives some information about the tree, like the number of leaves, size of the tree, etc. Next is the stratified cross-validation part and it shows the error rates. It shows how successful the model is. By right-clicking “Visualize tree”, the developed model’s tree can be visualized.
The performance measurements for this work were recall, precision, the classifier F1-score, and accuracy. Examining the data for accuracy and precision establishes the credibility of the results. Accuracy refers to how closely the measurements match the desired “true” value. Precision indicates how well repeated measurements agree with and are approximate to one another. As with the order of decisions about wind speed location, it is important that the values be close, i.e., a high level of precision, and at the same time, that the decisions be correct, i.e., a high degree of accuracy. The accuracy and the precision is defined in (5) and (6)
where T
P is true positive, T
N is true negative, F
P is false positive and F
N is false negative. The true positive and true negative is the outcomes where the developed model correctly predicts the cases. By contrast, a false positive and a false negative are the outcomes for which the model incorrectly predicts the cases.
Recall (R) is the ratio of the accurate data to the total relevant data. Its formula is shown in (7).
where T
P is true positive and F
N is false negative.
The classifier F1-score is calculated based on the harmonic mean. It is given as
where P is the precision and R is the recall.
The performance measurement results are listed in
Table 10.
Measurements from another performance indicator established with the use of a confusion matrix are presented in
Table 11. The confusion matrix was built based on the data testing, and a confusion matrix was constructed for each class in the form shown in
Table 12.
Because of the limited number of training cases, exercising care when minimizing and reserving the number of training samples for testing purposes is extremely important. Cross-validation was employed for testing, checking, and verifying the generalizability of the model. In training any model, a frequent tendency is to overfit, and cross-validation was applied as a means of avoiding this effect. The best way to improve the performance of a system is to reserve a small portion of the training data itself for use in validating the model since this approach provides an idea of the ability of the model to predict the previously unseen reserved data. K-fold cross-validation is a technique commonly used for this purpose. In a 10-fold version of k-fold cross-validation, the training set is randomly split into groups of 10 that have approximately the same size. The classifier is then trained using eight subsets. One of the two remaining subsets is used for validation and the last, for testing. This process is repeated until all folds, one by one, have an opportunity to be the assigned test version. This technique establishes the generalizability of the model, especially when limited data makes it difficult to break the data down into test data and training data.
Table 13 shows the average degree of accuracy for 2-fold, 4-fold, 6-fold, and 8-fold cross-validation and for the 10-fold cross-validation used in this paper.
In this research, a unique system was developed to arrange places according to wind speed. The process was carried out through three stages, i.e., the data collection stage, the processing stage, and the design stage. In the first stage, data are collected from different places, for example in the center, north, south, east, and west of the region. These data contain wind speed and other additional information such as location data from longitude and latitude and the date of collected samples. The data are collected in a central database and this database contains all the information deduced from the databases spread in different places. In the second stage (data processing stage), the information that is not useful in this research, such as longitude and latitude, is discarded and the date is replaced by the month. Then the central database is rearranged and the number of cases is reduced by using the association rules (a famous method of finding relationships) and this is done by studying all cases and their relationship to each other. This developed theory can be used for other places and other databases, and the developed method does not exist before in the literature. Machine learning methods depend on a set of algorithms, and these algorithms are applied to a set of data to build models that help in making decisions. This model is not limited to these data. This model can be used as a solid foundation to address similar problems in different areas. Other factors such as the direction of the wind, the maximum and minimum wind speed per day are important and might serve different applications. In this paper, however, the focus was on the wind speed to achieve a specific goal of providing the network operators with an understanding of the possible productivity of each wind site location, thus facilitating the optimal management and installation of wind plants and network operations. Such other factors open the door for great future work. The wind direction especially will play an important role in determining the place of the wind plants and the layout of wind turbines.
The proposed model shows great promise, so that two locations are sufficient for obtaining the order of preference of the locations. For example, if it is known only that the wind speed in the east region is below 3.05 m/s, then this scenario follows Case 2. Once the cause is known, the order of the wind speed values at all locations can be determined. If the wind speed in the east region is greater than 3.5 m/s but less than 3.72 m/s, the status of the wind speed at the other locations can be extracted from the Case 4 scenario. If the wind speed in the east region is greater than 3.77 m/s, the status of the wind speed at the center location and whether it follows Case 6 or Case 7 can be determined. Indeed, this feature of the proposed model saves the time and effort that would otherwise be required for predicting the wind speed at multiple locations. This model can thus be very helpful to system operators who desire an easy, quick, and accurate method of determining the status of the wind speeds at different locations.