*2.2. Methods*

### 2.2.1. Fuzzy Cognitive Maps Overview

A fuzzy cognitive map (FCM) is a directed graph in which nodes denote concepts important for the analyzed problem, and links represent the causal relationships between concepts [71]. It is an e ffective tool for modeling decision support systems. FCMs have been applied in many research domains, e.g., in business performance analysis [72], strategy planning [73], modeling virtual worlds [74], time series prediction [69], and adoption of educational software [75].

The FCM model can be used to perform simulations by utilizing its dynamic model. The values of the concepts change in time as simulation goes on [68]. The new values of the concepts can be calculated based on the popular dynamic model described as follows [59]:

$$\mathbf{X}\_{i}(t+1) = \mathbf{F} \begin{pmatrix} \\ \\ \mathbf{X}\_{i}(t) + \\ \\ j = 1 \\ j \neq i \end{pmatrix} = \mathbf{I} \tag{1}$$

where *Xi*(*t*) is the value of the *i*th concept at the *t*th iteration, *wj*,*<sup>i</sup>* is the weight of the connection (relationship) between the *j*th concept and the *i*th concept, *t* is discrete-time, *i*.*j* = 1, 2, ... , *n*, *n* is the number of concepts, *F(x)* is the sigmoidal transformation function [58]:

$$F(\mathbf{x}) = \frac{1}{1 + e^{-c\mathbf{x}}} \tag{2}$$

where *c* is a parameter, *c* > 0. The weights of the relationships show how causal concepts a ffect one another. If *wj*,*<sup>i</sup>* > 0, then an increase/decrease in the value of the *j*th concept will increase/decrease the value of the *i*th concept. If *wj*,*<sup>i</sup>* < 0, then an increase/decrease in the value of the *j*th concept will decrease/increase the value of the *i*th concept. If *wj*,*<sup>i</sup>* = 0, there is no causal relationship between the *j*th and the *i*th concepts [74].

The FCM structure is often constructed based on expert knowledge or surveys [74]. We could also use machine learning algorithms and available historical data to construct the FCM model and determine the weights of the relationships between the FCM's concepts.

### 2.2.2. Fuzzy Cognitive Maps Evolutionary Learning

Evolutionary algorithms are popular techniques for FCMs learning. In this paper, we explored two e ffective techniques: the real-coded genetic algorithm (RCGA) [68] and the structure optimization genetic algorithm (SOGA) [69].

### Real-Coded Genetic Algorithm (RCGA)

The RCGA algorithm defines individual in the population as follows [24]:

$$\mathcal{W}' = \begin{bmatrix} w\_{1,2,\prime} w\_{1,3,\prime} \ w\_{1,4,\prime} \dots \ w\_{j,i,\prime} \dots \ w\_{n,n-1} \end{bmatrix}^T \tag{3}$$

where *wj*,*<sup>i</sup>* is the weight of the relationship between the *j*th concept and the *i*th concept.

Individual in the population is evaluated with the use of a fitness function based on data error [66]:

$$fitness\_p(MSE\_{tr}(l)) = \frac{1}{aMSE\_{tr}(l) + 1} \tag{4}$$

where *a* is a parameter, *l* is the number of generation, *l* = 1, ... ,*L*, *L* is the maximum number of generations, *p* is the number of individual, *p* = 1, ... ,*P*, *P* is the population size, and *MSEtr*(*l*) is the data error, described as follows:

$$MSE\_{tr}(l) = \frac{1}{N\_{tr}} \sum\_{t=1}^{N\_{tr}} e\_t^2 \tag{5}$$

where *t* = 1, ... ,*Ntr*, *Ntr* is the number of training records, and *et* is the one-step-ahead prediction error at the *t*th iteration, described as follows:

$$
\varepsilon\_t = Z(t) - X(t) \tag{6}
$$

where *X(t)* is the predicted value of the output concept, and *Z(t)* is the desired value of the output concept.

When the maximum number of generations L is reached, or the condition (7) is met, which means that the learning process is successful, then the RCGA stops.

$$fitness\_p(MSE\_{tr}(l)) > fitness\_{max} \tag{7}$$

where *fitnessp*(*MSEtr*(*l*)) is the fitness function value for the best individual, and *fitnessmax* is a parameter.

Structure Optimization Genetic Algorithm (SOGA)

The SOGA algorithm is an extension of the RCGA algorithm [65,66] that allows the decision-maker to determine the most significant concepts and the relationships between them.

Individual is evaluated based on the fitness function based on new data error, described as follows [66]:

$$MSE\_{tr}'(l) = MSE\_{tr}(l) + b\_1 \frac{n\_r}{n^2} MSE\_{tr}(l) + b\_2 \frac{n\_c}{n} MSE\_{tr}(l) \tag{8}$$

where *b*1, *b*2 are the parameters of the fitness function, *nr* is the number of the non-zero relationships, *nc* is the number of the concepts in the analyzed model, *n* is the number of all possible concepts, l is the number of generation, *l* = 1, ... ,*L*, L is the maximum number of generations.

The fitness function that follows (9) calculates the quality of each population.

$$fitness\_p(MSE'\_{tr}(l)) = \frac{1}{aMSE'\_{tr}(l) + 1} \tag{9}$$

where α is an experimentally defined parameter, p is the number of the individual, *p* = 1, ... ,*P*, P is the population size, and *MSEtr*(*l*) is the new error measure.

We could construct a less complex time series prediction model by removing the redundant concepts and connections between them with the use of a binary vector *C* and the proposed error function.

The algorithmic steps of the learning and analysis of the FCM in modeling prediction systems with the use of population-based algorithms (SOGA and RCGA) were analytically presented in [69].

For our experiments, the evolutionary operators, a) ranking selection, b) uniform crossover, and c) random mutation were used [76,77]. In addition, we applied elite strategy selection, while a probability of crossover *Pc* and mutation *Pm* was assigned to each population.

### 2.2.3. Artificial Neural Networks

An artificial neural network (ANN) is a collection of artificial neurons organized in the form of layers [25]. Neurons are connected by weighted connections to form a NN. The most widely used ANNs in time series prediction are the multilayer perceptrons with an input layer, an output layer, and a single hidden layer that lies between the input and output layer. The most common structure is an ANN that uses one or two hidden layers, as a feed-forward neural network with one hidden layer is able to approximate any continuous function. Supervised learning algorithms and historical data can be used for the learning process of ANNs. The output of each neuron can be calculated based on the following formula:

$$\mathcal{X}(\mathbf{t}) = F \left( \sum\_{j=1}^{m} X\_j(t) \cdot w\_j + b \right) \tag{10}$$

where *Xj*(*t*) is the value of the *j*th input signal, *t* = 1, ... ,*Ntr*, *Ntr* is the number of training records, *wj* is the synaptic weight, *m* is the number of input signals, *b* is the bias, and *F* is the sigmoid activation function. Training a neural network needs the values of the connection weights and the biases of the neurons to be determined. There are many neural network learning algorithms. The most popular algorithm for ANN learning is the back-propagation method. In this learning method, the weights change their values according to the learning records until one epoch (an entire learning dataset) is reached. This method aims to minimize the error function, described as follows [14,78,79]:

$$MSE\_{tr}(l) = \frac{1}{2N\_{tr}} \sum\_{t=1}^{N\_{lr}} e\_t^2 \tag{11}$$

where *t* = 1, ... ,*Ntr*, *Ntr* is the number of training records, *l* is the number of epoch, *l* = 1, ... ,*L*, *L* is the maximum number of epochs, and *et* is the one-step-ahead prediction error at the *t*th iteration, which is equal to:

$$\mathbf{e}\_t = \mathbf{Z}(t) - \mathbf{X}(t) \tag{12}$$

where *X(t)* is the output value of the ANN, and *Z(t)* is the desired value. The modification of the weights in the back-propagation algorithm can be calculated by the formula:

$$
\Delta\_{\Psi\_{kj}}(l) = -\gamma \frac{\partial l(l)}{\partial w\_{kj}(l)} \tag{13}
$$

where <sup>Δ</sup>*wkj*(*l*) is a change of the weight *wkj* at the *lth* epoch, γ is a learning coe fficient.

Backpropagation algorithm with momentum modifies the weights according to the formula:

$$
\Delta\_{\text{w}\_{kj}}(l) = -\gamma \frac{\partial J(l)}{\partial w\_{kj}(l)} + \alpha \Delta\_{\text{w}\_{kj}}(l-1) \tag{14}
$$

where α is a momentum parameter.

### 2.2.4. Hybrid Approach Based on FCMs, SOGA, and ANNs

The hybrid approach for time series prediction is based on FCMs, the SOGA algorithm, and ANNs [68]. This approach consists of two stages:


This hybrid structure allows the decision-maker to select the most significant concepts for an FCM model using the SOGA algorithm. These concepts are used as inputs for the ANN model. Such a hybrid approach aims to find the most accurate model for time series prediction problems.

### 2.2.5. The Ensemble Forecasting Method

The most intuitive and popular way of forecast aggregation is to linearly combine the constituent forecasts [80]. There are various methods proposed in the literature for selecting the combining weights [81]. The most popular and widely used ensemble methods are the error-based and the simple average [82]. The easiest among them is the simple average in which all forecasts are weighted equally, often remarkably improving overall forecasting accuracy [82,83].

Considering that Y = y1, y2, y3, ... , y N T is the actual out-of-sample testing dataset of a time series and *Y*ˆ*i* = *y*ˆ*i* 1, *y*ˆ*i* 2, ... , *y*ˆ*i n T* is the forecast for the *ith* model, the linear combination of *n* forecasts is produced by [15]:

$$\mathfrak{H}\_{k} = w\_{1}\mathfrak{H}\_{k}^{(1)} + w\_{2}\mathfrak{H}\_{k}^{(2)} + \dots + w\_{n}\mathfrak{H}\_{k}^{(n)} = \sum\_{i=1}^{n} w\_{i}\mathfrak{H}\_{k}^{(i)},\ \forall k = 1,2,\dots,N\tag{15}$$

Here, our analysis is based on these most popular ensemble methods. A brief discussion follows for each one.

• The *simple average (AVG)* method [82] is an unambiguous technique, which assigns the same weight to every single forecast. Based on empirical studies in the literature, it has been observed that the AVG method is robust and able to generate reliable predictions, while it can be characterized as remarkably accurate and impartial. Being applied in several models, with respect to e ffectiveness, the AVG improved the average accuracy when increasing the number of combined single methods [82]. Comparing the referent method with the weighted combination techniques, in terms of forecasting performance, the researchers in [84] concluded that a simple average combination might be more robust than weighted average combinations. In the simple average combination, the weights can be specified as follows:

$$w\_{\mathbf{i}} = \frac{1}{n'} \quad \forall \mathbf{i} = 1, \ 2, \ \dots, n \tag{16}$$

• The *error-based (EB)* method [16] consists of component forecasts, which are given weights that are inversely proportional to their in-sample forecasting errors. For instance, researchers may give a higher weight to a model with lower error, while they may assign a less weight value to a model that presents more error, respectively. In most of the cases, the forecasting error is calculated using total absolute error statistic, such as the sum of squared error (SSE) [80,83]. The combining weight for individual prediction is mathematically given by:

$$w\_i = e\_i^{-1} / \sum\_{i=1}^n e\_i^{-1}, \ \forall i = 1, \ 2, \ \dots, n \tag{17}$$

### **3. The Proposed Forecast Combination Methodology**

In the rest of the paper, we explored a new advanced forecasting approach by introducing a di fferent split of dataset in the case of daily, weekly, or monthly forecasting, as well as a combination of forecasts from multiple structurally di fferent models, like ANN and FCM with various e fficient learning algorithms and hybrid configurations of them. Also, the two most popular and usually used ensemble methods, the AVG and the EB methods, were applied to the ensemble forecasts to improve the prediction accuracy.

In the described ensemble scheme, the selection of the appropriate validation set, i.e., the selection of the parameter *Nvd* and the group size *Ntr*, is very important. The validation set should reflect the characteristics of the testing dataset that is practically unknown in advance. As such, in this study, we set the following process of data split. The data split takes place by removing 15% of the total dataset *N* and saving for later use as testing data. The remaining 85% of the dataset is then split again into an 82/18 ratio, resulting in the following portions: 70% for training and 15% for validation. Also, the group size *Ntr* (i.e., the training data) should be appropriately selected so that it is neither too small nor too large.

Due to the problem nature, as we work with time-series data, the most e fficient method for resampling is the boosting/bootstrapping method [85]. In boosting, resampling is strategically geared to provide the most informative training data for each consecutive predictor. Therefore, in this study, an appropriate bootstrapping method was applied, so that the training dataset should have the same size at each resampling set, and the validation and testing sets should keep the same size (after excluding the *k*-values from the in-sample dataset).

The proposed e ffective forecast combination methodology for time series forecasting, presented in the paper, includes three main processing steps: data pre-processing to handle missing values, normalize the collected time-series data, and split the dataset; the various forecasting methods of ANNs, RCGA-FCMs, SOGA-FCMs, and hybrid SOGA FCM-ANN with their ensembles; and evaluation of the prediction results, implementing the two most popular and used ensemble methods of simple average (AVG) and error-based (EB). Figure 1 visually illustrates the suggested methodology.

In the followed approach, data preprocessing included outlier detection and removal, handling missing data, and data normalization, all of which were in accordance with the principles of Data Science practices described in corresponding literature. For outlier detection, the Z-score was first calculated for each sample on the data set (using the standard deviation value that is presented in the descriptive statistics Tables A1 and A2 in Appendix A). Then, a threshold was specified, and the data points that lied beyond this threshold were classified as outliers and were removed. Mean imputation was performed to handle missing values. Specifically, for numerical features, missing values were replaced by the mean feature value.

**Figure 1.** Flowchart of the proposed forecasting combination approach.

Each dataset was normalized to [0,1] before the forecasting models were applied. The normalized datasets were taking again their original values, while the testing phase was implemented. The data normalizations were carried out mathematically, as follows:

$$y\_i^{(new)} = \frac{y\_i - y^{(min)}}{y^{(max)} - y^{(min)}}, \forall i = 1, 2, \dots, N \tag{18}$$

where Y = y1, y2, y3, ... , yN*train* T is the training dataset, and Y(*new*) = *y*(*new*) 1 , *y*(*new*) 1 , ... , *y*(*new*) *N* T is the normalized dataset, *y*(*min*) and *y*(*max*) are, respectively, the minimum and maximum values of the training dataset Y.

We selected the Min-Max normalization method [86] as it is one of the most popular and comprehensible methods, in terms of performance of the examined systems, while several researchers showed that it produces better (if not equally good) results with high accuracy, compared to the other normalization methods [87,88]. In [88], the Min-Max was valued as the second-best normalization method in the backpropagation NN model, justifying our choice to deploy this method for data normalization. Moreover, since the FCM concepts use values within the range [0,1] for the conducted simulations and do not deal with real values, the selected method seemed to be proper for our study. Also, this normalization approach was previously used in [66,69].

Due to our intention to sugges<sup>t</sup> a generic forecasting combination approach (with ANNs, FCMs, and their hybrid structures) able to be applied in any time series dataset, the following steps are thoroughly presented and executed.

**Step 1. (Split Dataset)** We divided the original time series Y = y1, y2, y3, ... , yN T into the in-sample training dataset Ytr = y1, y2, y3, ... , yN*tr* T, the in-sample validation dataset Yvd = yN*tr*+1, yN*tr*+2, yN*tr*+3, ... , yN*tr*+N*vd* T, and the out-of-sample testing dataset Yts = yN*in*+1, yN*in*+2, yN*in*+3, ... , yN*in*+N*ts* T, so that *Nin* = *Ntr* + *Nvd* is the size of the total in-sample dataset and *Nin* + *Nts* = *N*, where *N* is the number of days, or weeks, or months, according to the short- or long-term prediction based on the time series horizon.

**Step 2.** (**Resampling method**/**Bootstrapping**). Let's consider *k* sets as training sets from the whole dataset every time. For example, in the monthly forecasting, we excluded one month every time from the initial in-sample dataset, starting from the first month of the time series values, and proceeding with next month till *k* = 12, (i.e., this means that 1 to 12 months were excluded from the initial in-sample dataset). Therefore, *k* subsets of training data were created and used for training. The remaining values of the in-sample dataset were used for validation, whereas the testing set remained the same. Figure 2 shows an example of this bootstrapping method for the ensemble SOGA-FCM approach. In particular, Figure 2a represents the individual forecasters' prediction values and their average error calculation, whereas, in Figure 2b, the proposed forecasting combination approach for SOGA-FCM is depicted for both ensemble methods.

**Figure 2.** (**a**) Forecasting approach using individual forecasters of SOGA-FCM and mean average, (**b**) Example of the proposed forecasting combination approach for SOGA-FCM using ensemble methods. SOGA, structure optimization genetic algorithm; FCM, fuzzy cognitive map.

If we needed to accomplish daily forecasting, then we preselected the number of days excluded at each subset *k*. For the case of simplicity (as in the case of monthly forecasting), we could consider that one day was excluded at each sub-set from the initial in-sample dataset. The overall approach, including ANN, FCMs, and hybrid configurations of them, is illustrated in Figure 3. In Figure 3, the

four ensemble forecasters were produced after the validation process and used for testing through the proposed approach.

**Figure 3.** The proposed forecasting combination approach using ensemble methods and ensemble forecasters.

**Step 3.** We had *n* component forecasting models and obtained Y ˆ *i* ts = *<sup>y</sup>*<sup>ˆ</sup>*iNin*+1, *<sup>y</sup>*<sup>ˆ</sup>*iNin*+2, ... , *<sup>y</sup>*<sup>ˆ</sup>*iNin*+*Nts* T as the forecast of Yts through the *ith* model.

**Step 4.** We implemented each model on Ytr and used it to predict Yvd. Let Y ˆ *i* vd = *<sup>y</sup>*<sup>ˆ</sup>*iNtr*+1, *<sup>y</sup>*<sup>ˆ</sup>*iNtr*+2, ... , *<sup>y</sup>*<sup>ˆ</sup>*iNtr*+*Nvd* T be the prediction of Yvd through the *ith* model.

**Step 5.** We found the in-sample forecasting error of each model through some suitable error measures. We used the mean absolute error (MAE) and the mean squared error (MSE). These are widely popular error statistics [68], and their mathematical formulation is presented below in this paper. In the present study, we adopted the MSE and MAE to find the in-sample forecasting errors of the component models.

**Step 6.** Based on the obtained in-sample forecasting errors, we assigned a score to each component model as γ*i* = 1 *MSE Yvd*, *Y*<sup>ˆ</sup>*ivd* , ∀*i* = 1, 2, ... , *n*. The scores are assigned to be inversely proportional to the respective errors so that a model with a comparatively smaller in-sample error receives more score and vice versa.

**Step 7.** We assigned a rank *ri* 1, 2, ... , *n* to the *ith* model, based on its score, so that *ri* ≥ *rj*, if γ*i* ≤ γ*j*, ∀*i*, *j* = 1, 2, ... , *n*. The minimum, i.e., the best rank is equal to 1 and the maximum, i.e., the worst rank is at most equal to *n*.

**Step 8.** We chose a number *nr* so that 1 ≤ *nr* ≤ *n* and let *I* = *i*1, *i*2, ... , *inr* be the index set of the *nr* component models, whose ranks are in the range [1, *nr*]. So, we selected a subgroup of *nr* smallest ranked component models.

**Step 9.** Finally, we obtained the weighted linear combination of these selected *nr* component forecasts, as follows:

$$\mathfrak{H}\_{k} = w\_{i1}\mathfrak{H}\_{k}^{i\_1} + w\_{i2}\mathfrak{H}\_{k}^{i\_2} + \dots + w\_{i\_{\text{tr}}}\mathfrak{H}\_{k}^{i\_{\text{tr}}} = \sum\_{\text{id}} w\_{i}\mathfrak{H}\_{k'}^{i} \quad \forall i = 1,2,\dots,n \tag{19}$$

Here, *wik* = γ*i k nr k*=1 γ*i k* is the normalized weight to the selected component model, so that *nr k*=1 *wik* = 1. **Step 10.** The simple average method could be also adopted, as an alternative to Step 6–9, to calculate the forecasted value.

The validation set was used during the training process for updating the algorithm weights appropriately and, thus, improving its performance and avoiding overfitting. After training the model, we could run it on the testing data, to verify if it has predicted them correctly and, if it has been so, to keep hidden the validation set.

The most popular and widely used performance metrics or evaluation criteria for time series prediction are the following: coe fficient of determination (R2), mean square error (MSE), root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). The mathematical equations of all these statistical indicators were described in the study [69]. The goodness of fit and the performance of the studied models, when they applied to a natural gas prediction process, were evaluated and compared using two of these five commonly used statistical indicators, namely, the MSE and the MAE [9]. In particular, the performance of the analyzed approaches for natural gas prediction was evaluated based on the following criteria:

1. Mean squared error:

$$\text{MSE} = \frac{1}{T} \sum\_{t=1}^{T} \left( Z(t) - X(t) \right)^2 \tag{20}$$

2. Mean absolute error:

$$\text{MAE} = \frac{1}{T} \sum\_{t=1}^{T} \left| Z(t) - X(t) \right| \tag{21}$$

where *X*(*t*) is the predicted value of the neural gas at the *t*th iteration, *Z*(*t*) is the desired value of the neural gas at the *t*th iteration, *t* = 1, ... , *Nts*, and *Nts* is the number of testing records. The lower values of the MSE and MAE indicate that the model performance is better with respect to the prediction accuracy, and the regression line fits the data well.

All the modeling approaches, tests, and evaluations were performed with the use of the ISEMK (intelligent expert system based on cognitive maps) software tool [66], in which all the algorithms based on ANNs, FCMs, and their hybrid combinations were developed. C# programming language has been used for implementing ensemble models and also for developing ISEMK, which incorporates FCM construction from data and learning, both for RCGA and SOGA implementations [69].

### **4. Results and Discussion**

### *4.1. Case Study and Datasets*

The natural gas consumption datasets that were used in this research work to examine the applicability and e ffectiveness of the proposed forecast methodology corresponded to five years (2013–2017), as described in Section 3. Following the first step of the methodology, we split our dataset into training, validation, and testing ones. For the convenience of handling properly the dataset, we defined the data of the first three years as the training dataset (1095 days), the data of the fourth year as the validation dataset (365 days), and the remaining data (5th year) as the testing dataset (365 days), which approximately corresponded to 60%, 20%, and 20%, respectively, as presented in Section 3. Thus, it was easier for our analysis to handle the above values as annual datasets and have a clearer perception of the whole process.

Out of the three years of the defined training dataset, we used the first two as the initial training dataset, while the third (3rd) year was used as a dataset reservoir for the bootstrapping procedure. This year was properly selected to be part of the initial dataset, as for each value of *k* (the bootstrapping step), a corresponding number of days/weeks/months was additionally needed to be included in the training dataset during the bootstrapping process, thus, avoiding any possible data shortage and/or deterioration that would lead to inaccurate results.

The proposed forecast combination approach, presented in Section 3, o ffered generalization capabilities and, thus, it could be applied in various time-series datasets, for a di fferent number of *k*, according to daily, weekly, or monthly prediction. Taking as an example the case of a month-ahead prediction, for each bootstrapping step *k*, the training dataset shifted one month ahead, getting one additional month each time from the reserved third year of the initial training dataset. In this case, *k* more months in total were needed for implementing e fficiently this approach. If we considered *k* = 12, then 12 additional months of the initial dataset needed to be added and reserved. This approach justified our case where one year (i.e., the third year) was added to the initial training dataset and was further reserved for serving the purposes of the proposed methodology. Di fferent values of *k* were also examined without noticing significant divergences in forecasting, compared to the selected *k* value.

In the next step, the validation procedure (comprising one year of data) was implemented to calculate the in-sample forecasting errors (MSE and MAE) for each ensemble forecasting algorithm (ensemble ANN, ensemble hybrid, ensemble RCGA-FCM, and ensemble SOGA-FCM). The same process was followed for the testing procedure by considering the data of the last year. The two examined ensemble forecasting methods, i.e., the simple average (AVG) and the error-based (EB), were then applied in the calculated validation and testing vectors (Yvd) for each one of the forecast combined methodology (Yvd-ANN, Yvd-Hybrid, Yvd-RCGA, Yvd-SOGA).
