1. Introduction
Although traditional energy resources, such as oil and coal, account for the largest proportion of energy worldwide, they also produce more pollution than solar energy. As environmental awareness and the need to reduce pollution have increased, solar energy has become an important energy source in industrialized countries. Photovoltaic systems convert solar energy into electrical energy. However PV systems are not yet popular and their transfer efficiency must be improved. Hence, engineers have used various combinations of system components to increase the transfer efficiency of PV systems. Generally, the transfer efficiency of a PV system is only 6–20% [
1]. According to the options of experts in PV energy of Taiwan, a transfer efficiency exceeding 9% is considered high and that ≤9% is considered low [
2]. Generally, engineers or energy managers must judge if a PV system belongs to one category or the other, thus, a reliable prediction model is needed to determine whether the transfer efficiency of a PV system is high or low. Managers or decision-makers in the PV field will then be able to identify the critical components using the prediction model and improve to transfer efficiencies. Thus, this work develops a novel and efficient prediction model to determine whether transfer efficiency of a PV system is high or low.
In applications of discriminating models, most studies utilized different approaches to construct an effective prediction model [
3,
4,
5,
6,
7]. These models were constructed using conventional statistical methods, such as discriminant analysis and logistic regression, or artificial intelligence (AI) methods, such as artificial neural networks (ANNs) and support vector machines (SVMs). Ong
et al. [
8] demonstrated that a discriminating model constructed using an ANN-based method is more accurate than a model constructed using traditional statistical methods, especially when data-sets are non-linear. However, ANN-based discriminating models have poor prediction accuracy when applied to small samples and input variables are irrelevant [
9]. Additionally, hidden layers in an ANN are difficult to explain and the relationship between input variables and output variables in an ANN or SVM cannot be expressed by a mathematical equation. Genetic programming (GP) has recently been applied in many fields to construct classification or prediction models. Since GP does not require any assumptions about the relationships between dependent and independent variables to construct a prediction model [
10], GP can be applied to both small and large samples [
8]. In some applications, GP has better prediction accuracy than ANN-based methods. For examples, Ong
et al. [
8] utilized GP to construct a more satisfactory credit scoring model than ANN model; Muttil and Lee [
11] utilized GP to predict coastal algal blooms and claimed GP can obtain more effective prediction model than ANN in their analytical case. In prediction or classification applications, GP can be used to construct a mathematical equation [
10,
11,
12]. Moreover, a comparison of the performance of classification models indicated that GP outperforms conventional statistical methods and ANNs [
13].
Measuring and monitoring energy efficiency have become important issues in many fields [
14]. Some studies have utilized data envelopment analysis (DEA) to assess energy efficiency. For instance, Boyd and Pang [
15] examined the relationship between productivity and energy intensity utilizing DEA to assess productivity. Hu and Kao [
16] developed an energy efficiency index utilizing DEA. This index is used to determine the energy-saving target ratio (ESTR) for seventeen APEC countries. Based on the importance of energy efficiency and the ability of DEA to determine the ratio between input and output variables, this work adopts DEA to evaluate the input/output efficiency of PV systems using multiple inputs, such as texture type, selection of a PV module, and PV module capacity, and one output (transfer efficiency of PV systems).
Moreover, identifying significant input variables is important when constructing an effective prediction model. Many conventional methods, such as correlation analysis, have been utilized to identify the significant input variables for predicting the output variable. However, such methods are restricted by some assumptions, such as a linear relationship among variables and normality, and large data-sets. Thus, a technique that provides a knowledge system contained in a data-set and clear attribute selection under different classes is desirable. Rough set theory (RST) can be utilized as a soft computing tool to deal with data-sets with poor information and remove irrelevant attributes from a data-set [
17]. Notably, RST has been applied in many real-world classification problems [
18,
19,
20].
To construct an efficient prediction model that determines whether the transfer efficiency of a PV system is high or low, this work uses input/output efficiency of a PV system as the predictive variable and enhances prediction accuracy using a novel hybrid model combining RST with GP; this model is called the RST-GP model. Because of its robust reliability in knowledge systems, RST is utilized during the first stage to identify significant input variables. During the second stage, significant independent variables obtained from RST are utilized as input variables for GP to construct a prediction model that can determine whether the transfer efficiency of a PV system is high or low. This remainder of this paper is organized as follows.
Section 2 reviews the PV system literature.
Section 3 briefly reviews the DEA model used to evaluate the input/output efficiency of a PV system.
Section 4 describes RST and GP.
Section 5 elucidates the proposed hybrid model.
Section 6 analyzes and compares the outcomes of the proposed and existing hybrid models.
Section 7 gives conclusions.
3. Using DEA to Determine Efficiencies
Notably, DEA is a linear programming (LP)-based technique for evaluating decision-making units (DMUs) and deals with many decision-making problems by converting multiple output and input variables into a single comprehensive performance measure [
23]. DEA is an extensively utilized non-parametric data analysis technique. For instance, Hu and Kao [
16] utilized DEA to construct an energy efficiency index. This index is used to determine the energy-saving target ratio (ESTR) for seventeen Asia-Pacific Economic Cooperation (APEC) countries. Tsai
et al. [
23] applied DEA with other measures to assess the magnitude of performance differences between leading telecom carriers. Guo and Tanaka [
24] utilized a fuzzy DEA model to solve an efficiency evaluation problem with given fuzzy input and output data. Wu
et al. [
25] used the DEA-neural network approach to evaluate branch efficiency for a large Canadian bank. Additional detailed descriptions of DEA can be found elsewhere [
26,
27,
28].
DEA, developed by Charnes, Cooper, and Rhodes (CCR) [
28], was based on Farrell’s (1957) pioneering study of efficiency measures (relative efficiency or productivity of a specific DMU) [
29]. Suppose data for each DMU,
, comprise
q positive outputs,
,
, and
p positive inputs,
,
. Let
ho (
) be the DMU whose relative efficiency is to be maximized. The DEA model is displayed as LP as follows:
where
are the variable weights of given to the
rth output and
ith input of the
oth DMU, respectively. Furthermore,
and
are decision variables of LP modeling used to determine the relative efficiency of DMU
o. Obviously, the maximum value (efficiency score),
, cannot exceed 1. If
, the DMU
o is called the constant returns to scale (CRS) frontier [
30]. There are two CCR models in practice. One minimizes input variables, and the other maximizes output variables. In this work, in order to obtain maximum energy efficiency, the maximized output variables of the CCR model are utilized to obtain the optimal value for the objective function,
.
5. The Proposed Hybrid Prediction Model
This work develops a four-step procedure for predicting whether the transfer efficiency of a PV system is high or low. The proposed prediction model is as follows:
Step 1: Collect transfer efficiencies of PV systems with various component combinations. These components are independent variables and transfer efficiency is a binary output variable (i.e., high or low) in the proposed prediction model.
Step 2: RST selects the significant independent variables of a PV system based on its robust reliability in knowledge system [
36,
37,
38,
39]. The importance of feature selection based on RST (
i.e., core analysis) can be explained as follows [
44]:
where
denotes the degree of dependence between conditional features
C (the variables of PV systems) and decision feature
D (
i.e., the high or low PV transfer efficiency),
denotes the degree of dependence between removing a conditional feature (such as
a condition feature) from
C and decision feature
D. denotes the variation of degree of dependence between removing
a from
C with all condition features
C. When
is large, feature
a importantly affects the decision attribute
D.
Step 3: The DEA evaluates energy efficiency (i.e., the input/output ratio) of a PV system. The input variables in DEA are obtained in Step 2 and the output variable in DEA is transfer efficiency of a PV system. The DMU values obtained from DEA represent energy efficiencies of PV systems.
Step 4: GP constructs a classification model for predicting whether transfer efficiency of a PV system is high or low. For the GP model, this work utilizes the significant independent variables obtained in Step 2 and the input/output ratio obtained in Step 3 as input variables of GP and binary transfer efficiency (
i.e., high or low) of a PV system is the output variable of GP.
Table 1 presents parameter settings of the GP model. The parameters of GP are obtained by trial-and-error approach.
Table 1.
The settings of GP model.
Table 1.
The settings of GP model.
Items | Content |
---|
Population size | 400 |
Maximum number of generation | 1000 |
Function set | +, −, ×, ÷, sin, cos, exp, log constant |
Crossover rate | 0.8 |
Mutation rate | 0.02 |
In Step 2, RST is utilized to select the significant independent variables of PV systems because adopting significant independent variables can yield good accuracy for constructing a prediction model [
36]. Moreover, RST can not only deal with small data-sets but also requires no statistical assumptions (such as a linear relationship between input variables with output variable). In Step 3, DEA is utilized to evaluate the energy efficiency of PV systems because the index (energy efficiency of PV systems) efficiently provides sufficient information for evaluating the economic-value of PV systems. In Step 4, GP is utilized to construct a prediction model because of its high performance in forecasting and classification. Furthermore, GP yields good forecasts using only small data-sets [
42]. Hence, RST, DEA, and GP are integrated herein to predict the high or low transfer efficiency of PV systems, and the model thus developed is called the RST-DEA-GP model.
6. Empirical Analysis
A real data-set of transfer efficiency of PV systems collected from a Taiwanese research organization is utilized to demonstrate the effectiveness of the proposed model. The data used in Step 1 concern 38 PV systems. Each PV system contains 18 variables (e.g., texture type, capacity for PV-transfer, and number of inverters) and binary transfer efficiency (e.g., low or high). The low and high transfer efficiencies of the PV systems are coded as 0 and 1, respectively. The data-set comprises 38 PV systems–15 with low and 23 with high transfer efficiencies.
Table 2.
Selected significant variables from RST and DMU variable from DEA.
Table 2.
Selected significant variables from RST and DMU variable from DEA.
Variables | Description | Importance (obtained from RST) |
---|
X1 | Texture type | 0.6424 |
X2 | The output power of inverter | 0.5715 |
X3 | The selection of PV module | 0.4817 |
X4 | The number of inverter | 0.3914 |
X5 | The weights of PV module | 0.3367 |
X6 | The selection of inverter | 0.2893 |
X7 | PV module capacity | 0.2567 |
X8 | The selection of DC voltage | 0.2638 |
X9 | The location of PV setting | 0.2476 |
X10 | DMU (obtained from DEA) | - |
In Step 2 of the proposed hybrid model, RST is utilized to identify significant independent variables of PV systems. The RST algorithm can be constructed using MATLAB software. The RST results indicate that nine independent variables (
X1–
X9) are significant (
Table 2) because that the importance value of nine independent variables are greater than 0.2. It has not a clear criterion to determine the threshold value (importance value). Moreover, the nine independent variables (
X1–
X9) have high correlation to output variable (the low or high transfer efficiencies of PV systems). The correlation coefficient are greater than 0.6. Also, based on the opinion of experts in PV energy in Taiwan, these nine variables importantly influence for the transfer efficiency of PV systems.
In Step 3, DEA is utilized to evaluate the DMU value of each PV system.
Table 2 shows the DMU value (
X10). In applying DEA, input variables of DEA are the nine significant variables obtained in Step 2 and the output variable of DEA is PV system transfer efficiency. The DEA algorithm can be executed by LINGO software.
Table 3 lists the DMU values of the PV systems. In Step 4, the significant independent variables obtained in Step 2 and DMU obtained in Step 3 are utilized as input variables for GP to predict the high or low level of PV system transfer efficiency. To demonstrate the effectiveness of the proposed hybrid model, some basic classification models such as
K Nearest Neighbor (KNN), Naive Bayes (NB), SVM, ANN, and GP are utilized as benchmark models. The basic classification models belong to data-mining techniques and can obtain better prediction performance than traditional linear statistical method (e.g., linear regression) [
8,
10].
Table 3.
The results of DMU value of each PV system by utilizing DEA.
Table 3.
The results of DMU value of each PV system by utilizing DEA.
No | DMU | No | DMU |
---|
PV001 | 1.0000 | PV023 | 0.7735 |
PV002 | 0.9482 | PV024 | 0.8059 |
PV003 | 0.9879 | PV025 | 1.0000 |
PV004 | 0.8392 | PV026 | 1.0000 |
PV005 | 1.0000 | PV027 | 1.0000 |
PV006 | 1.0000 | PV028 | 1.0000 |
PV007 | 1.0000 | PV029 | 1.0000 |
PV008 | 1.0000 | PV030 | 0.6981 |
PV009 | 1.0000 | PV031 | 0.6417 |
PV010 | 0.6902 | PV032 | 0.6608 |
PV011 | 0.9215 | PV033 | 0.4919 |
PV012 | 0.5153 | PV034 | 1.0000 |
PV013 | 0.4955 | PV035 | 0.8274 |
PV014 | 0.9667 | PV036 | 0.4947 |
PV015 | 0.7484 | PV037 | 0.8405 |
PV016 | 1.0000 | PV038 | 0.9944 |
PV017 | 0.6144 | | |
PV018 | 0.8630 | | |
PV019 | 1.0000 | | |
PV020 | 0.8630 | | |
PV021 | 1.0000 | | |
PV022 | 0.8832 | | |
Although some studies [
36] have also adopted hybrid classification models that combine RST, DEA, and SVM to predict business failures, the RST of their proposed methodology did not identify how to obtain the important variables based on a clear equation. This study [
36] only adopted the RSES software tool [
45] to select important variables. Furthermore, the SVM model performs well only with large data-sets, and collecting large data-sets for PV systems is difficult. Hence, the use of a suitable classification model for small data-sets is important for constructing a high-precision prediction model.
In order to compare the accuracy of hybrid prediction model when adding DEA or nor, this work does some design of experiments for prediction models. The proposed model, named RST-DEA-GP model, which adopts the significant variables obtained by RST and the DMU variable obtained in DEA as input variables for GP (model I). The RST-GP model adopts only the significant variables, X1–X9, as the input variables for GP (model II). In both models I and II, this work adopts leave-one-out cross validation to test the accuracy of the prediction model.
Table 4 and
Table 5 show the analytical results for hybrid models I and II, respectively. Model I has an average correct classification rate of 92.10%, and that of model II is 84.21%. Hence, adding DEA provides more information than adopting significant input variables only and enhances prediction model accuracy.
Table 4.
RST-DEA-GP model (model I) results with both significant variables and DMU.
Table 4.
RST-DEA-GP model (model I) results with both significant variables and DMU.
Actual class | Classified class |
---|
1 (High-level) | 2 (Low-level) |
---|
1 (High-Level) | 22 (95.65%) | 1 (4.35%) |
2 (Low-Level) | 2 (13.33%) | 13 (86.67%) |
Table 5.
RST-GP model (model II) results with only significant variables.
Table 5.
RST-GP model (model II) results with only significant variables.
Actual class | Classified class |
---|
1 (High-level) | 2 (Low-level) |
---|
1 (High-Level) | 21 (91.30%) | 2 (8.70%) |
2 (Low-Level) | 4 (26.67%) | 11 (73.33%) |
The RST-SVM-based models are also utilized to predict whether PV systems have high or low transfer efficiency. The RST-DEA-SVM model uses both significant variables obtained from RST and DMU as input variables of SVM (model III). The RST-SVM model, which utilizes only significant attributes, is model IV. In constructing the SVM model, this work utilizes STATISTICA software to generate a classification model. Some studies [
46,
47] utilized the Gaussian kernel function to enhance prediction performance. For the SVM model, parameters settings are the Gaussian kernel function,
C = 3, and
, which can generate an appropriate prediction model.
Table 6 and
Table 7 summarize prediction results for the confusion matrix utilizing models III and IV, respectively. Based on RST-SVM-based model results, adding DEA improves the correct classification rate from 78.94% to 81.57%.
Table 6.
RST-DEA-SVM model (model III) results with significant variables and DMU.
Table 6.
RST-DEA-SVM model (model III) results with significant variables and DMU.
Actual class | Classified class |
---|
1 (High-level) | 2 (Low-level) |
---|
1 (High-Level) | 20 (86.96%) | 3 (13.04%) |
2 (Low-Level) | 4 (26.67%) | 11 (73.33%) |
Table 7.
RST-SVM model (model IV) results with only significant variables.
Table 7.
RST-SVM model (model IV) results with only significant variables.
Actual class | Classified class |
---|
1 (High-level) | 2 (Low-level) |
---|
1 (High-Level) | 20 (86.96%) | 3 (13.04%) |
2 (Low-Level) | 5 (33.33%) | 10 (66.67%) |
Furthermore, two RST-ANN-based prediction models are applied. One uses the significant variables obtained from RST and the DMU variable obtained from DEA as input variables for an ANN (model V, named RST-DEA-ANN model). The RST-ANN model utilizes only significant variables as input variables for the ANN (model VI). This work uses Qnet2000 software to construct the ANN classification model. Cybenko [
48] demonstrated that utilizing one hidden layer is sufficient when modeling any complex system. Hence, the appropriate network models are 10-5-1 and 9-7-1 for nodes of the input layer, hidden layer, and output layer for models V and VI, respectively.
Table 8 and
Table 9 summarize prediction results for the confusion matrix utilizing models V and VI, respectively. Similarly, from the results of RST-SVM-based model, RST-ANN is also obvious that adding DEA can improve the correct classification rate from 76.31% to 81.57%.
Table 8.
RST-DEA-ANN model (model V) results with both significant variables and DMU.
Table 8.
RST-DEA-ANN model (model V) results with both significant variables and DMU.
Actual class | Classified class |
---|
1 (High-level) | 2 (Low-level) |
---|
1 (High-Level) | 21 (91.30%) | 2 (8.70%) |
2 (Low-Level) | 5 (33.37%) | 10 (66.67%) |
Table 9.
RST-ANN model (model VI) results with only significant variables.
Table 9.
RST-ANN model (model VI) results with only significant variables.
Actual class | Classified class |
---|
1 (High-level) | 2 (Low-level) |
---|
1 (High-Level) | 19 (82.61%) | 4 (17.39%) |
2 (Low-Level) | 5 (33.33%) | 10 (66.67%) |
With the same analysis of the above classification models (model I to VI), two RST-KNN-based and RST-NB-based prediction models are applied to predict whether PV systems have high or low transfer efficiency. This work also adopts STATISTICA to construct the KNN and NB classification models, respectively. For KNN classification, one uses the significant variables obtained from RST and the DMU variable obtained from DEA as input variables for a KNN (model VII, named RST-DEA-KNN model). The RST-KNN model utilizes only significant variables as input variables for the KNN (model VIII).
Table 10 and
Table 11 summarize prediction results for the confusion matrix utilizing models VII and VIII, respectively. Based on RST-KNN-based model results, adding DEA improves the correct classification rate from 73.68 % to 76.31%.
Table 10.
RST-DEA-KNN model (model VII) results with both significant variables and DMU.
Table 10.
RST-DEA-KNN model (model VII) results with both significant variables and DMU.
Actual class | Classified class |
---|
1 (High-level) | 2 (Low-level) |
---|
1 (High-Level) | 19 (82.61%) | 4 (17.39%) |
2 (Low-Level) | 5 (33.33%) | 10 (66.67%) |
Table 11.
RST-KNN model (model VIII) results with only significant variables.
Table 11.
RST-KNN model (model VIII) results with only significant variables.
Actual class | Classified class |
---|
1 (High-level) | 2 (Low-level) |
---|
1 (High-Level) | 18 (78.26%) | 5 (21.74%) |
2 (Low-Level) | 5 (33.33%) | 10 (66.67%) |
For NB classification, one uses the significant variables obtained from RST and the DMU variable obtained from DEA as input variables for a NB (model IX, named RST-DEA-NB model). The RST-NB model utilizes only significant variables as input variables for the NB (model X).
Table 12 and
Table 13 summarize prediction results for the confusion matrix utilizing models IX and X, respectively. Based on RST-NB-based model results, adding DEA improves the correct classification rate from 73.68 % to 76.31%.
Table 12.
RST-DEA-NB model (model IX) results with both significant variables and DMU.
Table 12.
RST-DEA-NB model (model IX) results with both significant variables and DMU.
Actual class | Classified class |
---|
1 (High-level) | 2 (Low-level) |
---|
1 (High-Level) | 19 (82.61%) | 4 (17.39%) |
2 (Low-Level) | 5 (33.33%) | 10 (66.67%) |
Table 13.
RST-NB model (model X) results with only significant variables.
Table 13.
RST-NB model (model X) results with only significant variables.
Actual class | Classified class |
---|
1 (High-level) | 2 (Low-level) |
---|
1 (High-Level) | 19 (82.61%) | 4 (17.39%) |
2 (Low-Level) | 6 (40%) | 9 (60%) |
To compare the performance of models I to XII with that of basic classification models (GP, SVM, ANN, KNN, and NB models), the latter, basic classification models are utilized to construct the prediction model. Additionally, the original 18 variables of the PV system are taken into account as the input variables of each basic classification model and the output variable is the binary transfer efficiency (high or low) of a PV system. The results (average correct classification rate) of the basic classification models, based on leave-one-out cross validation are: GP (84.21%, model XI), SVM (78.94%, model XII), ANN (78.94%, model XIII), KNN (71.05%, model XIV), and NB (68.42%, model XV). To determine the computational demands of all classification models (model I to XV), the computing time of each is calculated (
Table 14).
Table 14.
Computational time of different models (seconds).
Table 14.
Computational time of different models (seconds).
Model | Computational time |
---|
Model I (RST-DEA-GP) | 65.13 |
Model II (RST-GP) | 62.34 |
Model III (RST-DEA-SVM) | 60.17 |
Model IV (RST-SVM) | 56.49 |
Model V (RST-DEA-ANN) | 63.28 |
Model VI (RST-ANN) | 60.67 |
Model VII (RST-DEA-KNN) | 55.23 |
Model VIII (RST-KNN) | 53.28 |
Model IX (RST-DEA-NB) | 54.87 |
Model X (RST-NB) | 52.81 |
Model XI (GP) | 51.78 |
Model XII (SVM) | 50.46 |
Model XIII (ANN) | 51.39 |
Model XIV (KNN) | 48.23 |
Model XV (NB) | 46.26 |
These analytical results demonstrate that the proposed hybrid model is more accurate than other hybrid models in predicting whether the transfer efficiency of PV systems is high or low. Although the computational time of the proposed model exceeds that of the other models, its prediction of whether the transfer efficiency of PV systems is low or high is very precise. Notably, adding the DMU variable, obtained from DEA, as an independent variable to the GP, SVM, ANN, KNN and NB models yields more information than considering only significant variables, and enhances the classification accuracy rate of the proposed model. Finally, the proposed model can obtain greater performance than only adopting one classification model (GP).