1. Introduction
A fuzzy set is a set in which each element belongs to a certain set, based on which various methodologies, algorithms, and structures have been actively researched [
1,
2,
3,
4,
5]. Various studies related to a fuzzy set have been conducted; however, as a common feature, all these studies showed the model output as a numerical value [
6,
7,
8,
9,
10,
11,
12,
13]. On the other hand, Pedrycz et al. [
14] proposed a granular model (GM), in which the model output represents a triangular fuzzy number rather than a numerical value. GM generates information granulation by using a context-based fuzzy clustering algorithm that uses the properties of data in the input and output variables. The incremental granular model (IGM) is a structure that combines a linear regression (LR) and GM, where the first part computes error through the LR model, while the local part compensates for it through GM [
15,
16]. This method improves the system performance by using evolutionary approaches [
17,
18]. In addition, Leite et al. [
19] proposed a segmentation neural network that evolves from fuzzy data streams. Loia et al. [
20] proposed a functional network that takes into account the granularity and time delay of information. Colace et al. [
21] proposed a new computational model with an iterative structure through data segmentation.
Accuracy and clarity are important factors in evaluating the performance of the abovementioned prediction models. In general, the method of determining the model accuracy uses the conventional root mean square error (RMSE) or the mean absolute percentage error (MAPE). The RMSE evaluation method evaluates the model’s performance by subtracting its predicted value from the actual predicted value, calculating the mean of the squares, and then squaring the value. The following paper evaluated the model performance using the RMSE evaluation method. Huang et al. [
22] proposed minimum mean square estimator for mobile location using time-difference-of-arrival measurement. Yu et al. [
23] proposed a model based on the wide channel sounding of indoor staircases, corridors and office environments. Hwang et al. [
24] proposed an open-loop low-complexity multiple-input-multiple-output spatial multiplexing method, by which spatially multiplex multiple data streams and can be repeatedly detected [
25].
The following papers perform the evaluation method of MAPE. Draper et al. [
26] estimated the continental-scale domain as the mean-squared error. Wu [
27] presented a linear programming method for estimating the parameters under the criterion of minimizing MAPE (also called average relative error). Myttenaere et al. [
28] proposed the consequences of using MAPE as a quality measure for regression models. McKenzie [
29] used absolute percentage error and bias in economic forecasting applications. Kim [
30] applied mean absolute percentage error for intermittent demand forecasts. Besides, various MAPE methods have been studied so far [
31,
32,
33,
34,
35,
36].
Although methods for evaluating model accuracy have been actively researched, studies related to model clarity are still necessary. Pedrycz [
37,
38,
39] proposed a method to evaluate GM accuracy and clarity through a performance index (PI), to emphasize the information granularity of a fuzzy set. In addition, he proposed a fuzzy construction method using the parametric approach of fine granules [
36] and free-structure segmentation mapping using the principle of justification [
37]. Zhang et al. [
40] proposed granular aggregation method in distributed data environments. Zhongjie et al. [
41] explained the stabilization of information granules. Liu et al. [
42] focused on designing a model with a higher type based on information granules.
In this study, the prediction performance of interval-based IGM is compared and analyzed using the PI with information granules of linguistic intervals. The linguistic intervals are generated in the output space under three cases according to the segmentation method, and the PI obtained in each case is compared with the traditional performance evaluation method. To validate the PI method, experiments are performed on the concrete compressive strength example applied to civil engineering. This paper is organized as follows.
Section 2 describes the interval-based fuzzy clustering algorithm and GM.
Section 3 describes the architecture and procedure of IGM.
Section 4 describes the performance evaluation method, and
Section 5 conducts an experiment on a concrete compressive strength example. Finally,
Section 6 concludes the study and provides a future plan.
2. Interval-Based GM
In this section, we describe the interval-based GM for designing an interval-based IGM. Interval-based GM generates information granulation using the interval-based fuzzy c-means (IFCM clustering algorithm. The following section describes the IFCM clustering method and interval-based GM.
2.1. IFCM Clustering
In this study, interval-based GM is designed by IFCM clustering to generate information granulation. The IFCM clustering method creates an interval in the output space, taking into account the pattern characteristics of the input and output spaces. The general fuzzy clustering method [
43] estimates a cluster by using the distance between its center and the input data in the input space without considering the output space. On the other hand, the IFCM performs clustering by considering the pattern characteristics of the output space, and thus, performs better than the existing clustering method.
Next, IFCM determines the cluster centers and the membership matrix using the following steps
- [Step 1]
Set the number of intervals I (1 < I < q) and the number of clusters per interval C (2 < C < n).
- [Step 2]
Initialize the membership matrix with random values between 0 and 1.
- [Step 3]
Calculate
(i = 1, 2, ...,
C), which is the center of each cluster, by the following equation.
- [Step 4]
Compute the partition matrix U considering as follows
where,
represents the degree of inclusion of
in the generated cluster. Then,
can be denoted by A to
.
denotes the membership degrees induced by the
ith cluster and the
jth data.
represents the
kth input data in the
cth cluster.
- [Step 5]
When the equation is satisfied, the above process stops. Otherwise, the process is started again from Step 3.
In the IFCM, the number of interval and clusters per interval must be set in advance.
Figure 1 shows the concept of the IFCM clustering method, where the intervals and clusters are estimated by setting the numbers of sections and clusters to 4 and 3, respectively. As shown in
Figure 1, each interval has linguistic meaning and each cluster produced by the intervals is represented by fuzzy if–then rules.
In general, when creating an interval in the output space, the method of dividing the interval without an even overlapping is used. In this study, the performance of GM is evaluated by adding a method to divide the interval flexibly and without any overlapping, based on a stochastic distribution, and a method to divide the interval evenly at overlapping intervals. In general, a method of dividing an interval not being evenly overlapped is used to divide the output space evenly. Thus, we only need to consider the spacing of each interval. We define the general partitioning method as Case 1. The interval division method used in this study is as follows.
First, the method of dividing the interval without overlapping flexibly adjusts the interval length from the probability distribution of data in the output. The interval of a portion with a large distribution value is short, while that with a small distribution value is long. We define the method of flexible and non-overlapping partitioning as Case 2. Second, the method of dividing the interval evenly and overlapping a certain range is similar to the general method, but the difference lies in overlapping the ends of each interval. This method allows each section to overlap a certain range to find additional similar features.
Figure 2 shows a conceptual diagram of Cases 1, 2, and 3 according to the method of interval division.
Figure 2a shows Case 1, which divides the output space uniformly, and
Figure 2b shows Case 2, which flexibly divides the output space based on a stochastic distribution. The second interval of
Figure 2b is observed to be short, because the distribution value is large.
Figure 2c shows Case 3, which is divided evenly over a certain range.
By changing the granularity of the intervals and its distribution in the output, we can adjust the width of the linguistic intervals in the output. The adjusted intervals can be helpful in further enhancing the granular model [
16].
2.2. Interval-Based GM
GM is simply a web of associations between the constructed information granules; this model is inherently granular.
Figure 3 shows the structure of GM, which comprises an input space, an output space, and three layers. Given the numerical input values, the model returns some information granules, especially some linguistic intervals, as shown in
Figure 3.
The features of GM include the following: first, a set of granules of information are generated in the inputs as well as output. Second, the output of GM is expressed as information granulation rather than a numerical value, and has the shape of an interval. The GM’s output Y is computed by the fuzzy number as follows
Figure 4 shows GM’s final and actual output values. The GM’s output comprises the interval shape, and the prediction performance can be validated by checking whether the actual output is included in the final output value of GM. The GM’s output comprises the limit values as follows
4. Performance Evaluation Method
A performance evaluation plays an important role in evaluating the accuracy and clarity of the proposed model, and various such methods have been developed so far. Some common performance evaluation methods include RMSE and MAPE. The RMSE method evaluates performance by subtracting the model predicted value from the actual predicted value, calculating the mean of the squares, and then squaring the value. The MAPE method evaluates the performance by subtracting the model predicted value from the actual output value, and then dividing it by the model predicted value. Thus, these methods evaluate performance using numerical model predictions.
However, it is difficult to evaluate the model using the general performance evaluation method because the IGM output based on the proposed interval is the number of fuzzy intervals, not numerical. Therefore, in this study, the model is evaluated using the PI, which is a performance evaluation method suitable for particle models.
PI
In this study, the PI method evaluates the prediction capability of interval-based IGM. This method evaluates performance by using the property of coverage and specificity. The initial PI method was proposed by Pedrycz, and since then, various PI methods have been proposed by Hu [
27], Galaviz [
28], and Zhu [
29].
Figure 7 and
Figure 8 show the characteristics of coverage and specificity, respectively.
Coverage indicates the range of the GM output. If the actual output value falls between the upper and lower bound of the GM’s output, a value close to 1 is given. Conversely, if the actual output value does not fall within the GM output range, a value is close to 0. In other words, it is checked whether the actual output is included within the range of fuzzy numbers in the section form, which is the GM output. If the coverage is large, the GM performance is excellent.
Specificity indicates the fineness and characteristics of the GM output, showing the distance from the upper limit of the GM output to the lower limit. If this distance is short, the specificity is given a value with high detail characteristics, and if the distance is long, it is given a value with small detail characteristics. In other words, we check the degree of distance from the upper bound to the lower bound of the GM output. If the specificity is large, the GM performance is excellent.
Figure 9 shows the relation between coverage and specificity, where the PI values are found to be curved. The two values have a tradeoff in which the specificity value decreases when the coverage value increases, and vice versa. Therefore, it is important to balance the two values without being biased on either side.
5. Experiment and Results Analysis
In this experiment, a concrete compressive strength (CCS) dataset was used to predict the CCS to compare and analyze the prediction performance of interval-based IGM (Cases 1, 2, and 3).
5.1. Database
In this experiment, the CCS dataset [
44] was used to compare and analyze the prediction performance of interval-based IGM, which comprises the data for the CCS of high-performance concrete (HPC). This dataset consists of eight input variables and one output variable. The input includes eight variables and one output variable of CCS.
The ASN (Airfoil Self-Noise) data set is a measurement of noise generated by NACA 0012 in a wind tunnel environment of various speeds and angles. The ASN data set includes 5 inputs and 1 output variable. The input variables are frequency, angle, cord length in meters, free flow velocity, and suction side displacement thickness, and the output variable is the sound pressure level scaled in decibels.
The MPG (miles per gallon) data set contains data that predict the fuel consumption of a vehicle. The MPG data set includes 7 inputs and 1 output variable. The input variables are cylinder, displacement, horsepower, weight, acceleration, model year, model name, and output variable is vehicle fuel consumption. The training and verification data were divided into two equal halves and normalized to a value between 0 and 1 to obtain a more accurate value.
5.2. Experimental Method and Results
In interval-based IGM, the predictive performance of Cases 1, 2, and 3 according to the segmentation method of the interval existing in the output space is compared and analyzed through the PI approach. The interval-based IGM’s performance was verified through the PI method using the property of coverage and specificity described in
Section 4. The experiments were performed as the number of intervals in the interval-based IGM increases from 2 to 10. In addition, the number of clusters per each interval was increased from 2 to 10, and the weighting exponent was fixed at 2. The experiments were performed in Cases 1, 2, and 3.
Table 1 and
Table 2 show the experimental results for Case 1, which is a method of equally dividing the interval without any overlapping. As a result of Case 1, the number of intervals was found to be 7. We selected 10 clusters in each interval, which was approximately 0.39, showing the best result.
Figure 10 shows the output value for Case 1 and the actual output value, and
Figure 11 shows the PI value for Case 1 in the form of a mesh.
Table 3 and
Table 4 show the experimental results for Case 2, which is a method of flexibly dividing the intervals through probability distribution without overlapping. As a result, the number of intervals was found to be 7 and the number of clusters created in each interval was 10, which was approximately 0.49, and thus, the best result was obtained.
Figure 12 shows the output value for Case 2 and the actual output value, and
Figure 13 shows the PI value for Case 2 in the form of a mesh.
Table 5 and
Table 6 show the experimental results for Case 3, which is a method of evenly dividing the intervals with overlapping. As a result, the number of intervals was found to be 5. We selected 10 clusters in each interval, which was approximately 0.44, thus yielding the best result.
Figure 14 visualizes the output value for Case 2 and the actual output value, and
Figure 15 shows the PI value for Case 2 in the form of a mesh.
Table 7 lists the excellent prediction results obtained for each interval-based IGM. Case 1 and 2 showed good performance when the numbers of intervals and clusters were 7 and 10, respectively, and generated 70 rules. Case 3 showed good performance when the numbers of intervals and clusters were 5 and 10, respectively. According to the experimental results, the performance was found to be superior when the intervals overlapped rather than that in the case of non-overlapping divisions, as well as when the division was more flexible than the equal division.
Next, we will look at the overfitting problem in the construction of the interval-based IGM. In the case of the CSS data set, the number of intervals and clusters per interval was increased from 2 to 50, respectively. As a result, it was confirmed that overfitting occurs when the number of intervals is large.
Figure 16 and
Figure 17 show the variation of performance index as the number of intervals and clusters per interval increase for training and testing data set, respectively. As shown in
Figure 17, we found that performance index decreases when the number of intervals is more than 7. As to the number of cluster per interval, the performance index increases slightly. We will study the optimal allocation of clusters for future research.
6. Conclusions
In this study, the predictive performance of interval-based IGM according to the partitioning method of dividing the intervals in the output space was compared and analyzed using the PI method. According to the experimental result, Case 1 had seven intervals, in each of which 10 clusters were generated, with a PI value of approximately 0.39, thus yielding the best performance. Case 2 had seven intervals, in each of which 10 clusters were generated, with a PI value of approximately 0.49, thus yielding the best performance. Finally, Case 3 showed the best performance with a PI value of approximately 0.44 when there were five intervals and 10 clusters generated in each interval. When analyzing the results, Cases 1 and 2 achieved the best prediction results when 70 rules were generated, and Case 3 showed the best result when 50 rules were generated. It was confirmed that a section and a cluster suitable for data should be generated rather than a plurality of intervals and clusters. In the future, studies on segmentation methods and performance evaluation methods other than Cases 1, 2, and 3 will be conducted.