*5.2. Evaluation Metrics*

In order to fairly evaluate the effectiveness of the proposed MCSCNN–LSTM deep model, we adopted multiple evaluation metrics consisting of the root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE), as shown in Equations (28)–(30), where *N* is the number of testing samples, the *f orecast* is the forecasted value, and *real* is the ground truth. RMSE evaluates the model by the standard deviation of the residuals between real values and forecasted values; MAE is the average vertical distance between ground truth values and forecasted values and is more robust to the larger errors than RMSE. However, when massive data are utilized for training and evaluating the model, the RMSE and MAE increase significantly and quickly. Therefore, MAPE is needed, which is the ratio between residuals and actual values.

$$RMSE = \sqrt{\frac{\sum\_{n=1}^{N} \left(forecast\_n - real\_n\right)^2}{N}} \tag{28}$$

$$MAE = \frac{\sum\_{n=1}^{N} |forceast\_n - real\_n|}{N} \tag{29}$$

$$MAPE = \frac{100\%}{N} \sum\_{n=1}^{N} \left| \frac{for \text{cast}\_{\text{ll}} - real\_{\text{ll}}}{real\_{n}} \right| \tag{30}$$

### *5.3. Performance Comparison with Other Excellent Methods*

We compared our proposed method to other excellent deep learning-based methods: DNN- [34], NPCNN- [35], LSTM- [20], and CNN–LSTM-based [25] methods. The structure and configuration information of the above comparative methods are given in Table 5. Because the abovementioned methods employed other additional sensor data, we adopted the structure of them only. Conv1D is a convolutional layer with 1-D; Max1D is a max-pooling layer with 1D. We run 10 times of each deep learning-based method to overcome the impact of randomness, and every time runs at 50 epochs. Furthermore, we found the above NPCNN- [35], LSTM- [20], and CNN–LSTM-based [25] methods did not learn at some iterations. It means the loss does not decrease with the increase of training epochs. Instead, they keep one constant value from the first epoch. In summary, NPCNN, LSTM, and CNN–LSTM highly rely on initial processing. The results of this phenomenon are listed in Table 6, which gives the times of the above cases during 10-time training processes for each forecast. Notably, the term "None" means it always learns from raw data and is not sensitive to the random initial settings. For electricity consumption, we must avoid unpredicted and unexpected factors. However, NPCNN- [35], LSTM- [20], and CNN–LSTM-based [25] methods highly rely on initialization. The

findings show only DNN [34], and the proposed method always learns so that they are considered as stable.


**Table 5.** The structure and configuration information of comparative methods.


**Table 6.** The non-training times of each deep learning-based method among 10 times.

We compared the proposed method to the stable DNN [34] with averaged metrics of 10 times, and also compared averaged metrics of the proposed approach to the best results of three unstable methods: NPCNN [35], LSTM [20], and CNN–LSTM [25] in 10 times, as shown in Table 7 with RMSE, Table 8 with MAE, and Table 9 with MAPE. The findings reveal that our proposed method has absolute priority for different durations electricity forecasting at all evaluation metrics compared to DNN. Even when compared to the best results of the other three methods, the proposed MCSCNN–LSTM keeps the highest performance of all metrics for all data sets except for VSTF on the data set DAYTON with evaluation metric MAPE, and RMSE on data set AEP for STF. LSTM [20] performs a little better. In summary, the proposed MSCSNN–LSTM could forecast the electricity consumption of different durations accurately and stably.

The averaged improvements of MAPE on different data as shown in Figure 4. The results show that it improves a lot at all durations forecasts. Especially for STF, MTF, and LTF, which was beyond 50% compared to all the above methods. We select stable DNN as listed in Table 7 to compare the predicted results, as shown in Figure 5. The findings show both the proposed method and DNN [34] can predict the global trend of electricity consumption at VSTF, STF, and LTF. However, DNN cannot predict long-term electricity consumption. Moreover, the proposed method outperforms DNN; it can predict more detailed irregular trends for VSTF, STF, and LTF, respectively. We can see details from the marked deep-red box in VSTF and STF of Figure 5.


**Table 7.** The comparison results with RMSE.

**Table 8.** The comparison results with MAE.


**Table 9.** The comparison results with MAPE.


**Figure 4.** The averaged MAPE improvement results for each duration forecast compared to other excellent deep learning-based methods on three data sets. Each data set contributes the same to the final outputs.

**Figure 5.** The comparative forecasting results using two different deep learning-based methods. The *x*-axis is the time stamp at different duration; the *y*-axis is electricity consumption. (**a**) VSTF (hourly) forecasting results. (**b**) STF (daily) forecasting results. (**c**) MTF (weekly) forecasting results. (**d**) LTF (monthly) forecasting results.

### *5.4. Feature Extraction Capacity of MCSCNN–LSTM*

To better understand the feature extraction capacity of the proposed MCSCNN–LSTM, firstly, we compared it to some single models in MCSCNN–LSTM using the metric of MAPE: Multi-Scale CNN (MSCNN), Multi-Channel and Multi-Scale CNN (MCSCNN), and hybrid conventional stacked CNN–LSTM (SCNN–LSTM). The structure of LSTM is the same as [20], which is not stable. All configurations of those models are the same as MCSCNN–LSTM. The results are average of 10 times as shown in Table 10.


**Table 10.** The comparison results for validating the feature learning capacity using averaged MAPE.

Comparing the MSCNN with the NPCNN [35], we find that the proposed MSCNN is more stable than general CNN-based methods. The results indicate MSCNN with single input cannot extract satisfactory feature representations for different forecasts of electricity consumption by comparing MSCNN with MCSCNN, especially for STF, MTF, and LTF. By comparing SCNN–LSTM to LSTM, we can see the MCSCNN can extract elegant, robust features to avoid instability and improve performance. We computed the averaged improvement of MAPE on three data sets for different duration forecasts, as shown in Figure 6. MSCNN is selected as the baseline. The findings reveal that the proposed method promotes a lot for all kinds of forecasts. Attentively, the results show only the performance of the MCSCNN on data AEP for VSTF decreased. Furthermore, the level of feature extraction capacity is ranked as proposed: > SCNN–LSTM > MCSCNN > MSCNN.

Secondly, we have analyzed the inside features to confirm the productive feature extraction capacity of the proposed deep model. The visualization results using one VSTF sample shown in Figure 7a,b intimate double-channel inputs: one raw data sample and corresponding statistics components by using the normalized data sample of AEP. Figure 7c is a CNN-learned feature map through the raw data sample. Figure 7d is the feature map of LSTM and we marked it with a red box at the comprehensive feature map Figure 7f, and Figure 7e is a statistic feature map that is marked with the yellow box in Figure 7f. The unmarked part in Figure 7f is a CNN-learned feature map. Figure 7f is a comprehensive feature map. The findings reveal that CNN can learn multi-scale robust global features with less noise because it almost has no changes around 0. The feature map of LSTM ranges from −0.100 to 0.075, and the statistic feature map ranges from 0.00 to 1.75, which indicates statistic components are more useful to extract detailed patterns than LSTM. The comprehensive feature maps combined robust multi-scale global features and detailed features of different domains.

**Figure 6.** The averaged improvement of the proposed deep model for each duration forecast based on MSCNN on three data sets.

**Figure 7.** Feature maps visualization. Each part of the proposed deep model extracted different features. (**a**) The raw sample channel. (**b**) The statistic components channel. (**c**) CNN-learned feature map, which almost has no changes around 0. (**d**) LSTM-learned feature map, which ranges from –0.10 to 0.075. (**e**) Statistic components feature map of a reshaped tensor. The raw statistic components channel was reshaped into [1,6], which ranges from 0.00 to 1.75. (**f**) Reshaped comprehensive feature map. The shape of the obtained feature is 1 by 66, and we reshaped it into 11 by 6 to clearly see and analyze.
