This section presents the results obtained by the ML regression methods described above in different prediction problems with human activity patterns. First, the performance of the ML algorithms is evaluated in these datasets, and then, the effect of including meteorology is also evaluated at this point.
4.2. Results
First, the results of the MLP, ELM, SVR, and ANA in the school absences dataset are reported in
Table 6.
As can be seen, the ANA baseline approach has a high performance in this dataset, indicating that similar days in the past had similar school absence figures. The best performance among all ML approaches is the SVR, which outperforms the ANA approach, with the best MAE of
. The effect of including meteorology in the prediction is positive since all methods improve their results when meteorology is considered, which indicates that bad weather has an influence on school absences, especially on the RMSE. ELM and MLP obtain the worst results, even worse than ANA, which is considered the baseline. This result shows that the input variables used for the prediction are not enough for these methods to explain all the variability in the absence problem. Absences could be produced by health problems, such as flu, and maybe seasonal, like in this last example. The last column in the table shows the training time (tt, in seconds) for the different ML regressors considered. As can be seen, the ANA approach is the fastest, and the SVR shows the highest training time among all the approaches evaluated. Note that the ELM is the fastest ML approach.
Figure 4 shows the performance of the different regressors considered in the test set of the school absences dataset, including meteorology input variables.
The second dataset with a human activity pattern where we study the performance of the ML regression methods is the bike-sharing demand data in Madrid city (BiciMad).
Table 7 shows the numerical performance of the ML regression techniques in comparison with the ANA approach.
As can be seen, the best result in this dataset is obtained by the MLP algorithm, closely followed by the ELM, with 1428 and 1506 RMSEs, respectively. The SVR performs worse in this problem, obtaining a 1552 RMSE, and the ANA algorithm does not perform well in this case, with a 1804 RMSE. Note that R reaches 83% in the result obtained by the MLP, which indicates a much better performance of the prediction algorithms on this dataset than in the previous one. Considering the meteorology-related input variables also has a positive impact on the prediction capacity of the ML approaches in this dataset, improving their performance in all cases, as expected, since meteorological information is key in this problem about the number of rented bicycles (demand) in Madrid city. Compared to the absence dataset (the previous evaluated case), here, the evaluated ML methods report better performance to predict the bike demands. The input variables chosen are suitable for this specific problem and well determine the bike demanding problem.
The visual performance of the different algorithms can be seen in
Figure 5 (regression approaches considering meteorology input variables). As can be seen, the accuracy of the prediction in this dataset is much better than in the previous case of the school absences dataset.
Except the ANA method, all other methods predict the bike demand time series well.
We also evaluated here the performance of ML algorithms in the prediction of the parking occupancy in San Sebastian, Northern Spain.
Table 8 shows the numerical performance of the ML algorithms in terms of the different quality metrics considered.
In this case, the best ML approach in terms of the MAE is the ELM with
. However, in terms of the RMSE and
, the MLP reaches the best results with
, respectively. Both the MLP and ELM report very good performance on this dataset as in the previous one. The difference in the MAE and RMSE shows that the ELM behaves slightly worse in predicting a few samples. The SVR obtains worse results. The ANA approach does not obtain competitive results either in this problem. As in the previous cases, considering meteorology inputs in the problem improves the performance of the ML regression techniques considered, in all cases. We can see that the input variables involved are suitable to predict parking occupancy.
Figure 6 visually depicts the performance of all regressors tested in this problem. It is possible to see that the ANA approach fails to produce a correct prediction in the second part of the time series, whereas the ML approaches work consistently better in all cases.
Finally, we dealt with the problem of the prediction of the number of packets delivered in a post office at Azuqueca de Henares, Spain.
Table 9 shows the performance of the different regressors considered in this problem.
In this case, the best algorithm seems to be the MLP, which obtains an MAE of
R , closely followed by the SVR with
and 51. However, the MLP obtains a better RMSE than the ELM,
, which means that it is less sensitive to strong mispredictions. The ELM results are not suitable for predicting package deliveries, and it reports worse results than its ML counterparts. The ANA obtains very poor results, in this case even worse by adding meteorological variables. In this problem, considering meteorology variables as inputs improves a bit the performance of the algorithms, but less than in the previous cases.
Figure 7 visually shows the performance of the different regressors considered in this problem. Note that the number of packets delivered at Azuqueca grows exponentially at the last part of the test set (pandemic months were eliminated from the test set since they were not significant). This exponential growth is, in fact, a collateral effect of the COVID-19 pandemic, where the number of packet deliveries boomed, because of e-commerce growth. In this part of the series, all the regressors have a poor performance, though the ELM seems to be the best algorithm in this part.
Finally, we analyzed the effect of having similar days in the past (training set) on the performance of ML methods considered. We carried out this analysis by considering the school absences dataset, in which we calculated how many similar days to the current one (
n) there are in the training set (
). We depict this time series together with the prediction of the different ML regression techniques in
Figure 8a–c. In fact, we depict
(instead of
) for a better matching with the prediction of the ML algorithms as a black line in the figures. As can be seen, there are day types in the test set without a similar counterpart in the training set, which leads to a lack of similar situations in the past to train the ML algorithms. It is possible to see that the most important prediction errors in all ML algorithms considered are produced on these days without similarity in the training set. This means that having enough information on similar day types in the past (training set) is key to obtaining the good performance of the ML algorithms in these prediction problems with human activity patterns. Note that this behavior is somehow related to the concept of the
persistence of the system, a concept related in turn to the memory of the system. More specifically, persistence is an important characteristic of many complex systems in nature, related to how long the system remains in a certain state before changing to a different one [
41]. In the case of prediction problems with human activity patterns, persistence is related to the existing information in the training set to predict a given unseen sample in the test set. In other words, the different ML approaches can obtain good predictions when they are trained with similar cases, but if there is not a given persistence level in the system (i.e., there are no similar cases in the past), the ML training quality degrades, as shown in this final experiment for the school absences prediction problem.