4.1.7. Comparison of the Different Methods

Following the hyperparameter tuning process, the best performing models of the different algorithms were compared, and the results obtained are presented in Table 8. To this effect, the following metrics were compared across the different models, namely CC, RAE, RRSE, MAE, and RMSE. Our findings indicate that although it may seem that some algorithms performed better than others, nevertheless, the performance gap suffices only marginally. For example, there existed only a 1.899% reduction in the error rate in using the ANN over the GR model in terms of their RAE. A difference of about 3.553% in the RRSE existed between the SVM and the RF algorithm. Thus, suggesting an insignificant difference between the different models, sequel to a proper hyperparameter tuning process.

**Table 8.** Comparison of the different methods based on their best parameter settings.


No model performed best across all the different metrics, thus emphasizing the need to avoid comparing different ML models using only a single metric. For example, although the ANN performed best considering the RAE, it generated the smallest RRSE values compared to the other models. Since these different metrics tell different stories, it is essential to consider our analysis across each metric as against a single metric. To this effect, by rendering a higher RRSE value, we note that the ANN model may have been plagued by more outliers than the other methods. This observation is again supported by examining the MAE against the RMSE in Table 8, which shows a higher RMSE than other methods, except the SVM.

In addition, we examined the CC values of the different models, with results of the correlation matrix presented in Figure 3. By comparing the CC achieved by the different models against the target demand, we observed that a *CC* < 0.9 was obtained across all models. This implies a good positive correlation between the predicted and the target demand values. In addition, we can observe that a *CC* ≈ 1 was obtained between the different models, further emphasizing that the different models all predicted the same values. In particular, the ANN, GR, KNN and RF models all performed equally with little to distinguish them.


**Figure 3.** The correlation matrix of the different methods for the system hourly demand dataset.

Finally, in quantitative terms, a Tukey comparative test of the different models was performed, and the output results are documented in Table 9. The Tukey test is a multiple comparison procedure that can be used to find means that are significantly different from each other. The aim in using this test is to quantitatively determine whether there exists a significant difference between the mean results obtained across the different models or not. Further details regarding the Tukey test can be accessed in [46]. The symbols used to interpret the range of the *p*-values, *p*, obtained for all the Tukey tests reported in this article, are provided in Table 10. An examination of Table 9 indicates that there was no significant (ns) difference between the target and predicted data of the different models (see column 5 of Table 9). It also confirms that there was no significant difference between all other methods as well, with *p*-values all averagely being greater than 0.997. These results support the correlation findings of Figure 3, further emphasizing that following a proper hyperparameter tuning exercise of the different algorithms, they all perform, on average, the same, with little or no significant difference between them.
