2.3.1. ResNeXt Architecture

The ResNeXt architecture was proposed by Xie et al. [17] as an improved version of the *ResNet* model [35]. The *ResNet* architecture [35] addresses the difficulty in training very deep neural networks by introducing shortcut connections, a technique in which the input of an architectural block is added to its output in order to obtain the final output. By passing information from earlier layers to deeper layers, the network can optimize residual mappings, thus making it possible to efficiently train very deep architectures. This process can be viewed as a form of feature fusion, in which features at different levels of depth are combined using an addition operation [36]. Multiple residual blocks are stacked to form deep networks [35].

ResNeXt further builds on this architectural blueprint by using *grouped convolutions* instead of plain convolutions inside the residual blocks. Grouped convolutions are a type of convolutional layer in which the input is split channel-wise into multiple groups, with each group being processed individually by convolutions and concatenated at the end to obtain the final result. This construction has been shown to be equivalent to applying a set of aggregated transformations, which can be formalized as follows.

Given an input *x* and a hyperparameter that is called *cardinality C*, an aggregated transformation can be obtained from a set of transformations {*<sup>τ</sup>*1,..., *<sup>τ</sup>C*} as:

$$\mathcal{F}(\mathbf{x}) = \sum\_{i=1}^{\mathbb{C}} \tau\_i(\mathbf{x})$$

Following the strategy in *ResNet*, the aggregated transformation is a residual connection, which leads to the following computation for the output:

$$y = x + \sum\_{i=1}^{\mathbb{C}} \tau\_i(x)$$

Figure 1 shows a schematic representation of the two types of blocks that are used in the *ResNet* and ResNeXt architectures. It has been shown experimentally that tuning the hyperparameter *C* can lead to significant performance improvements in image classification tasks [17].

As in the case of *ResNet*, the ResNeXt architecture is composed of a succession of blocks [17].

**Figure 1.** The *ResNet* (**left**) and *ResNeXt* (**right**) blocks. The kernel size in each convolution is shown in the figure.

2.3.2. Formalization, Data Modeling and Preprocessing

We denoted the radar products of interest by *P* = {*<sup>r</sup>*1,*r*2, ... ,*rn*}, where *n* is the dimensionality of *P* (the number of radar products that were used). For our case studies that were described in Section 2.2, we obtained the following values for *P* and *n*:


The radar data that were input at a certain time moment *t* were denoted by *It* and were modeled as 3D images with *n* channels (corresponding to the available radar products), with the *i*-th channel representing the value of the radar product *ri* at time *t*. More specifically, the OX and OY axes represented the longitudinal and latitudinal values of the geographical area and the OZ axis represented the channels (i.e., the values of the radar products *P* at time moment *t*).

A sample 4-channel 3D image (with *n* = 4 products) is shown in Figure 2.

Given a certain step *k*, the goal of our learning problem was to predict the 3D image at time moment *t* from the 3D images that were collected at the time moments *t* − *k*, *t* − *k* + 1, ... , *t* − 1. In our model, the output was also an *n*-channel 3D image, in which the value of a point on the *i*-th channel of the image *It* was the value that was predicted for the radar product *ri* at time *t*. We noted that one time step (i.e., the time period between two consecutive time moments *t* − 1 and *t*) represented the time resolution between two consecutive radar scans. More specifically, a time step was 6 min for the NMA case study and 5 min for the MET case study.

We denoted the sequence of 3D images that represented the radar data that were collected at time moments *t* − *k*, *t* − *k* +1, ... , *t* −1 by *Seq*(*<sup>t</sup>*, *k*) = < *It*−*k*, *It*−*k*+1, ... , *It*−<sup>1</sup> >. In this context, the target function of our learning problem was a function *M* that mapped the *k*-length sequence of *n*-channel 3D images (*Seq*(*<sup>t</sup>*, *k*)) onto another *n*-channel 3D image

(*It*), i.e., *It* = *<sup>M</sup>*(*Seq*(*<sup>t</sup>*, *k*)). The *NeXtNow* deep learning model learned hypothesis *h*, which was an approximation of *M* (*h* ≈ *M*), i.e., *h*(*It*−*k*, *It*−*k*+1, ... , *It*−<sup>1</sup>) ≈ *It* ∀*t*. Thus, for a sequence of images (*Seq*(*<sup>t</sup>*, *k*)), *NeXtNow* provided a multi-channel 3D image *It* that contained the estimation of the values of the radar products at time *t*.

**Figure 2.** A sample 4-channel 3D image for *It*.

A sequence of 3D images that contained radar data that were collected at different time moments *t* was available. A dataset D was created from sequences in the form of < *It*−*k*, *It*−*k*+1, ... , *It*−<sup>1</sup> >, i.e., a sequence of *n*-channel 3D images that represented radar data that were collected at time moments *t* − *k*, *t* − *k* + 1, ... , *t* − 1. For each instance, the *Seq*(*<sup>t</sup>*, *k*) from the ground truth D (i.e., the *n*-channel 3D image *It* that contained the values of the radar products at time *t*) was available and was used to train the model.

Before building the *NeXtNow* deep learning model, a preprocessing step was applied to the 3D images *It* to correct any possible errors that existed in the radar data. For the NMA dataset, two different preprocessing methods were used, depending on the product. For R, the only preprocessing that was carried out was to replace the "No Data" (NaN) values with "0". For the V product, there was a more complex preprocessing step. The issue with V was that it was a very noisy product because it represented the velocity *relative to the radar*, so there were some cases in which the radar could not properly estimate the direction or the speed, thus producing invalid values. These invalid values appeared often enough that they could interfere with the model learning [37]. We addressed this problem by introducing a cleaning step, which replaced the invalid values with valid values. The new values were computed as the weighted average of the values in the neighborhood surrounding the invalid value. The weight of a value in the neighborhood was inverse proportional to the difference between that value and the invalid value.

The raw MET data were preprocessed as follows. Since the raw data had negative reflectivity values, which were not important for nowcasting, these values were all replaced with a constant value of −1. Additionally, the NaN values that corresponded to missing radar measurements were replaced with values that were outside the domain of valid reflectivity values, i.e., −5, in order to be able to distinguish them from the negative values and the reflectivity values that were of interest (i.e., the positive values).

As well as the previous preprocessing steps, the data were normalized using the classical *min*-*max* normalization method. For the *min*-*max* normalization, we used the minimum and maximum values from the domain of the radar products instead of the minimum and maximum values from the training dataset. This way, we made sure that the same values in different datasets were assigned the same normalized values. In the case of the MET dataset, the minimum value that was used for normalization was −5, which corresponded to the missing radar measurements.

2.3.3. Building the *NeXtNow* Model

The predictive model *NeXtNow* was built using a training dataset that consisted of training samples in the form of (*Seq*(*<sup>T</sup>*, *k*), *It*), where *It* = *<sup>M</sup>*(*Seq*(*<sup>T</sup>*, *k*)) represented the ground truth (the 3D image that consisted of the real values of the radar products at time *t*) that was used to train the instance *Seq*(*<sup>t</sup>*, *k*) = < *It*−*k*, *It*−*k*+1,..., *It*−<sup>1</sup> >.

The proposed model had a fully convolutional encoder–decoder architecture, which was formed of three main components. The first component was an encoder, which was inspired by the ResNeXt architecture.

The encoder consisted of two classical convolutions, which had the role of providing multiple feature maps for the inputs, followed by three ResNeXt blocks. The blocks were constructed according to the original ResNeXt paper [17], as presented in Section 2.3.1. The final convolution in the block multiplied the filter size by four, while the group convolution downsampled the input image by a factor of two. Each convolution in the block was followed by a batch normalization layer and the ReLU activation function. The convolutions that were used in the encoder had a kernel size of 3 × 3.

The second component was a series of eight identical ResNeXt blocks, with 1024 filters each. In contrast to the blocks that were used in the encoder, the blocks that were included in this component did not change the resolution or number of filters of their inputs, but they did have the aim of obtaining refined representations for the feature maps that were retrieved from the encoder. Empirically, we found that the addition of these additional blocks was beneficial to the model's overall performance.

While the first two components benefited from the use of ResNeXt blocks, we opted for a succession of simple convolutional layers for the decoder as experimenting with more complex architectural components did not lead to a better performance for the forecasting model. Therefore, in our proposed model, the decoder consisted of a series of upsampling layers, followed by convolutions. Following a standard approach to designing architectures for image-to-image tasks, the number of filters was progressively increased in the encoder and decreased in the decoder.

A schematic representation of the *NeXtNow* architecture is shown in Figure 3.

**Figure 3.** The proposed *NeXtNow* architecture. The ResNeXt blocks are depicted in blue, while classical convolutional layers are shown in orange. In the case of the ResNeXt blocks, the filters that correspond to the first and last convolutions in the block are shown, while only only the number of filters is shown for the plain convolutions. This figure was created using the PlotNeuralNet package [38].

The proposed architecture represented a new purely convolutional approach to weather nowcasting. The main advantage of our model was the simplicity and flexibility of the architecture, which allowed it to be easily adapted for other spatiotemporal prediction

tasks with few hyperparameters that needed to be tuned. A limitation of our approach was that it did not incorporate a recurrent component for modeling the time dimension, relying instead on a simple concatenation operation for the time steps. Our model could be extended, however, by including modules from our architecture in recurrent architectures as feature extractors.

The datasets for both case studies (NMA and MET) were split into train, validation and testing subsets from the total number of days that were available (i.e., 20 days for the NMA dataset and 65 days for the MET dataset): 80% for training, 10% for model validation and the remaining of 10% for testing. From each subset (training/validation/testing), the complete days (with no missing time steps) were used.

### 2.3.4. Performance Evaluation and Testing Methodology

As shown in Section 2.3.3, after the *NeXtNow* model was trained, it was evaluated using 10% of the instances from the datasets D, which were unseen during the training stage.

Various performance metrics were computed to assess the performance of *NeXtNow* using a testing subset. The experiments were repeated three times using three different training–validation–testing splits and the values for each of the performance metrics were averaged over the three runs.

Depending on the type of the input data that was used in the forecasting problem, there were three types of verification methods that were used for the performance evaluation: categorical, continuous (real values) or probabilistic approaches. Our experiments used the continuous approach since we modeled the problem as a regression task and used continuous input data that were mapped onto a continuous output.

The first set of evaluation metrics that we considered used the continuous ground truth data and the continuous forecasts that were made by the *NeXtNow* model. Given a testing dataset with *n* ground truth data samples in which each sample was an image containing *m* points, we denoted the ground truth (observation) value for the *i*-th point in the *t*-th testing instance by *Ot*,*<sup>i</sup>* and the prediction (forecast) value for the *i*-th point in the *t*-th testing instance by *Ft*,*i*. The following evaluation metrics that have been used in the regression literature were computed for each testing sample [39]:

•*Root mean square error* (*RMSE*), which was computed as the square root of the mean square errors that were obtained using the *t*th testing data sample:

$$RMSE(t) = \sqrt{\frac{\sum\_{i=1}^{m} (O\_{t,i} - F\_{t,i})^2}{m}}.$$

Lower *RMSE* values indicated better predictions.

•*Correlation coefficient* (*CC*), which expressed a linear relationship between the forecast and the actual observation (ground truth) and was computed as

$$\text{CC}(t) = \frac{\sum\_{i=1}^{m} (F\_{t,i} - \overline{F}\_t)(O\_{t,i} - \overline{O}\_t)}{\sqrt{\sum\_{i=1}^{m} (F\_{t,i} - \overline{F}\_t)^2} \sqrt{\sum\_{i=1}^{m} (O\_{t,i} - \overline{O}\_t)^2}}$$

,

where *Ot* represents the average of the actual observations (*Ot* = 1*m* · *m* ∑ *i*=1 *Ot*,*<sup>i</sup>*) and *Ft* is 1*m*

the average of the forecasts (*Ft* = *m* · ∑ *i*=1 *Ft*,*<sup>i</sup>*). *CC* produced values between [−1, 1], where *CC* = 1 represented a perfect fit between the forecast and the true observation that was obtained. Higher values of CC indicated better predictions.

• The radar reflectivity data included numerous missing points, which corresponded to the regions for which the radar did not detect any signals. In the NMA case study, these points were associated with 0 values, while for the MET dataset, which contained negative values, we encoded missing radar reflectivity data using a value of −5, as presented in Section 2.3.2. In order to present common terminology and notations, we referred to these points as zero-labeled points. Since we were not interested in the prediction performance at these points, we only computed the values for the *RMSE* and *CC* performance metrics for the non-zero labeled instances , i.e.,:

$$RMSE\_{nz}(t) = \sqrt{\frac{\sum\_{i, O\_{t,i} \neq 0} (O\_{t,i} - F\_{t,i})^2}{n\_z(t)}}$$

where *nz*(*t*) = |{*i* ∈ {1, ... , *<sup>m</sup>*}|*Ot*,*<sup>i</sup>* = 0}| is the number of non-zero points in testing sample *t* and

$$\text{CC}\_{\text{tw}}(t) = \frac{\sum\_{i, O\_{t,i} \neq 0} (F\_{t,i} - \overline{F}\_t)(O\_{t,i} - \overline{O}\_t)}{\sqrt{\sum\_{i, O\_{t,i} \neq 0} (F\_{t,i} - \overline{F}\_t)^2} \sqrt{\sum\_{i, O\_{t,i} \neq 0} (O\_{t,i} - \overline{O}\_t)^2}}$$

where *Ot* and *Ft* represent the mean observations and forecasts that were computed across the non-zero points.

The values that were obtained for all of the testing samples were averaged in or*n*∑*RMSE*(*t*)

> *t*=1

.

*n* ,

der to obtain the final evaluation metrics for the testing subset: *RMSE* =

$$\text{RMSE}\_{nz} = \frac{\sum\_{t=1}^{n} \text{RMSE}\_{nz}(t)}{\sum\_{t=1}^{n} \text{RCC}(t)}, \text{CC} = \frac{\sum\_{t=1}^{n} \text{CC}(t)}{\sum\_{t=1}^{n} \text{and} \text{CC}\_{nz}} \text{ and } \text{CC}\_{nz} = \frac{\sum\_{t=1}^{n} \text{CC}\_{nz}(t)}{n}$$

For a thorough assessment of *NeXtNow*'s performance, we discretized its continuous output by applying a threshold in order to evaluate the performance of our model using additional evaluation metrics. For meteorologists, the classes of the values of the radar products are particularly relevant, for example, for stratiform and convective rainfall classification. By applying a threshold *τ* to the continuous output values that were provided by *NeXtNow*, the set of evaluation metrics was enlarged with the performance metrics that were used for binary classification: values that were higher than *τ* could be considered as belonging to the *positive* class, while values that were lower than *τ* belonged to the *negative* class.

For the testing dataset, after computing the confusion matrix that corresponded to the binary classification task (*TP*, number of true positives; *FP*, number of false positives; *TN*, number of true negatives; *FN*, number of false negatives), the evaluation metrics that are described below were calculated:



We note that the *CSI*, *FAR*, *POD* and *BIAS* metrics have been widely used for performance assessment in the forecasting literature. *CSI*, *FAR* and *POD* ranged between [0, 1], while the domain of *BIAS* was [0, ∞). Higher values of *CSI* and *POD* and lower

*FAR* values were expected for better predictions, while *BIAS* values of closer to 1 were expected for better forecasting models.
