*3.3. Experimental Results*

This section presents the results that were obtained for our proposed model using the two case studies that were presented in Section 2.2. We first evaluated our model in terms of predicting the evolution of the radar data products using a lead time of one time step (5 or 6 min) in the future. Table 2 shows the results that were obtained for the regression metrics using the two case studies for predicting one time step in the future, where *k* denotes the number of previous time steps that were used in the prediction (as presented in Section 2.3). The results for the classification evaluation metrics that were obtained for the considered thresholds are presented in Table 3. The best values are highlighted.

By analyzing the performance metrics that are shown in Table 3, we observed that the performance decreased with the increases in the threshold value, which was to be expected since high reflectivity values were scarce in the dataset, thus were challenging to predict accurately. For both the NMA and MET datasets, the results in Tables 2 and 3 revealed that the average performance for predicting one time step in the future using four previous time steps (*k* = 4) was better than that when only using one previous time step (*k* = 1) in terms of the *FAR*, *CC* and *CCnz* performance metrics. We noted that for the NMA dataset, the *FAR* value was better for *k* = 1 with high values for the threshold *τ* (i.e., 20 and 30). For the other performance metrics (*RMSE*, *RMSEnz*, *CSI*, *POD* and *BIAS*) the performance was better for *k* = 1. This suggested that when the *NeXtNow* model used four previous time steps, it reduced the number of forecasts that were false alarms; however, it forecasted a smaller number of events than when it only used one previous time step.

The values for the performance metrics (Tables 2 and 3) that were obtained by our *NeXtNow* model using one previous time step (*k* = 1) for both the NMA and MET datasets were tested against those that were obtained using four previous time steps (*k* = 4) using a two-tailed paired Wilcoxon signed-rank test [43,44]. A *p*-value of less than 0.00001 was obtained, which highlighted that the differences between the *NeXtNow* performances when using one and four previous time steps were statistically significant, with a significance level of *α* = 0.01. Thus, the results revealed that using multiple time steps did not improve the performance of predictions for one time step in the future, which corresponded to a lead time of 6 min for the NMA case study and 5 min for the MET case study.

While this result might seem counter-intuitive, it was not completely unexpected for the NMA experiments. In our previous work on radar data from NMA [45], we used unsupervised neural network techniques (self-organizing maps) to mine relevant patterns from the data and empirically showed that when predicting the value of the next step at a location, there were no significant differences between the patterns that were mined from one previous time step and those that were mined from five previous time steps. In other words, when the patterns were similar, adding more time steps did not add much more information. We hypothesized that this occurred because we were using both reflectivity and velocity, thus the network had the possibility to find the trajectory of the meteorological event from a single time step because of the velocity product, so multiple time steps did not add much information.

**Table 2.** The results for the regression evaluation metrics for predicting one time step in the future, where *k* denotes the number of previous time steps that were used in the predictions. The means and standard deviations that were computed across the three experimental runs are shown. The best values for the performance metrics are marked with bold and colored with yellow (for the NMA case study) and with blue (for the MET case study).


**Table 3.** The results for the classification evaluation metrics for predicting one time step in the future. The means and standard deviations that were computed across the three experimental runs are shown. The best values for the performance metrics are marked with bold and colored with yellow (for the NMA case study) and with blue (for the MET case study).


In light of these new results, it might just be that because meteorological data change very slowly from one time step to another [46], the trajectory of the meteorological event was not so relevant when only predicting one time step (5 to 6 min) in the future. While the velocity product might still be enough to make multiple past time steps redundant, this could explain why using multiple previous time steps did not improve the results for the MET experiments, in which the velocity product was not used. This meant that using multiple previous time steps, which would allow the network to learn to compute the trajectory of an event, could be more useful when predicting further than 5 min in the future.

In other words, in the absence of radar products that relate to motion, we hypothesized that including multiple radar measurements could encapsulate information regarding the direction of movements and hence, improve the predictive performance for *larger lead times* than just 5 min. In order to validate this hypothesis for the MET case study, we further evaluated the predictive performance of forecasts that were performed by our model for 15 min in the future. The model was trained in a similar manner to before, with the difference that it was optimized to predict the radar reflectivity values at a time point 15 min in the future using a series of *k* consecutive time steps. As previously, we performed

the experiments for *k* ∈ {1, <sup>4</sup>}. The results for the regression and classification metrics are presented in Tables 4 and 5, respectively. The best obtained results are highlighted for each regression metric (Table 4) and each classification metric for each value of the threshold *τ* (Table 5).

By analyzing the results for both the regression and classification metrics in Tables 4 and 5, we observed that the results that were obtained for *k* = 4 time steps were better than the results that were obtained using only one past time step, which confirmed our hypothesis. Only for the threshold value of 30 were the values for *FAR* and *BIAS* slightly better for *k* = 1 than for *k* = 4, but this might be due to the data imbalance (reflectivity values that were higher than *τ* = 30 were scarce in the dataset). By comparing the regression metrics in Tables 2 and 4 and the classification metrics in Tables 3 and 5, we also observed that our model obtained better results for a 5-min lead time than for a 15-min lead time, which could be explained by the fact that the forecasts were more challenging for lead times that were further in the future.

Figure 9 illustrates some sample predictions from our *NeXtNow* model, which was trained using the NMA dataset. The first column depicts the inputs, the second column presents the predictions and the last column shows the actual radar observations. Each row in the figure shows a different product (R01 to R04).

**Table 4.** The results for the regression evaluation metrics for predicting three time steps in the future using the MET case study, where *k* denotes the number of previous time steps that were used in the predictions. The means and standard deviations that were computed across the three experimental runs are shown. The best values for the performance metrics are marked with bold.


**Table 5.** The results for the classification evaluation metrics for predicting three time steps in the future using the MET case study. The means and standard deviations that were computed across the three experimental runs are shown. The best values for the performance metrics are marked with bold.


**Figure 9.** Sample predictions from our model that was trained using the NMA dataset. The first column depicts the inputs, the second column presents the predictions and the last column shows the ground truth observations. Each row shows a different product (R01 to R04). The illustrated observations and predictions correspond to radar measurements that were gathered in an area of approximately 250 × 250 km.

Figures 10 and 11 show sample predictions that were obtained using our model that was trained using the MET case study for lead times of 5 min and 15 min, respectively. In both figures, the first four columns show the inputs, the fifth column depicts the predictions and the last column shows the actual radar observations. As can be observed from the figures, the predictions that were produced by the model were smoother than the actual observations.

The experimental results that were previously presented revealed a decrease in *NeXtNow*'s performance at higher reflectivity values, which was likely due to the imbalance between the amounts of smaller and higher reflectivity values in the datasets. Extending the dataset by including a much larger number of convective events could improve the prediction at higher reflectivity values. Nevertheless, when dealing with storm-based nowcasting, as per the NMA case study, the prediction of reflectivity spatial patterns is equally important to assess the evolution of convective storms. Our future work is envisaged to address this challenge.

By comparing the predictions of *NeXtNow* for one time step in the future using the NMA (Figure 9) and MET (Figure 10) datasets, we could observe better predictions at high reflectivity values using the NMA dataset. This improvement in *NeXtNow*'s prediction performance at higher reflectivity values could be due to the velocity field introducing supplementary information about convections.

245

**Figure 10.** Sample predictions from our model that was trained using the MET dataset for a 5-min lead time. The first four columns show the inputs, the fifth column depicts the predictions and the last column shows the observations. The illustrated observations and predictions correspond to radar measurements that were gathered in an area of approximately 250 × 250 km.

**Figure 11.** Sample predictions from out model that was trained using the MET dataset for a 15-min lead time. The first four columns show the inputs, the fifth column depicts the predictions and the

last column shows the observations. The illustrated observations and predictions correspond to radar measurements that were gathered in an area of approximately 250 × 250 km.
