**1. Introduction**

Forecasts of future atmospheric state has mainly been accomplished by numerical weather predictions (NWP), which is a technology and capability after five decades of development and improvement [1]. Data assimilation (DA) further promoted the capability of NWP to more accurately predict future weather, which is achieved by better capturing initial conditions for NWP and quantifying its uncertainties [2]. In recent years, machine learning (ML), especially neural networks (NN), has been increasingly applied in the atmospheric sciences and has shown great potential [3–6]. Its capability of recognizing patterns in high-dimensional data sets without needing underlying theoretical equations has been appealing and has benefited various research disciplines [7,8].

Various previous studies have applied ML to NWP. One area is postprocessing NWP model outputs to reduce systematic biases. Ref. [9] used logistic regression and random forests to calibrate the probabilistic precipitation forecasts and improve verification statistics. Ref. [10] applied machine learning to postprocess NWP outputs in high-impact weather events to further improve the forecast skill. Ref. [11] trained a nonlinear NN to predict physical variables such as 2 m temperatures and achieved significant improvement compared to conventional postprocessing methods. Ref. [12] applied deep learning in precipitation nowcasting and 1-hour predictions from radar images. In [13], a deep NN was trained with ensemble weather forecasts for postprocssing, which achieved a relative improvement in ensemble forecast skill of over 14%. Ref. [3] developed a global prediction

**Citation:** Tian, X.; Conibear, L.; Steward, J. A Neural-Network Based MPAS—Shallow Water Model and Its 4D-Var Data Assimilation System. *Atmosphere* **2023**, *14*, 157. https:// doi.org/10.3390/atmos14010157

Academic Editor: Jimy Dudhia

Received: 1 December 2022 Revised: 27 December 2022 Accepted: 5 January 2023 Published: 10 January 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

model using a Fourier Forecasting Neural Network that takes the atmospheric state in the initial conditions and predicts a few 2D variable in future times. Ref. [14] proposed a deep neural network in the form of Graph Neural Network (GNN) [15] to make forecasts in six-hour increments and trained with ERA5 dataset. The forecast performance was shown to outperform the global high resolution operational product, HRES, by the European Centre for Medium-Range Weather Forecasts.

Hybrid modeling combining ML and NWP has also been explored in numerous recent studies. Ref. [16] investigated the possibility of replacing the longwave radiative transfer model with a NN-based model and achieved an accuracy comparable to the conventional algorithm in a general circulation model. Ref. [17] emulated the longwave radiation parameterization for the National Center for Atmospheric Research (NCAR) Community Atmospheric Model with a NN and produced almost identical results 50–80 times faster. Ref. [18] trained a deep neural network to resolve atmospheric subgrid processes in climate modeling by learning from a multiscale model with explicit convections. As promising as the results in these studies show, Ref. [19] pointed out that in hybrid modeling, the feedback between the NN and the General Circulation Model (GCM) can cause instability in simulations and make the experiment crash within days. Similarly, hybrid approaches in DA have been explored in a few studies. Ref. [20] emulated the nonorographic gravity wave drag parametrization with a NN and developed the corresponding tangent linear and adjoint components, which were successfully used in a 4D-Var DA system. Ref. [21] formulated a Lorenz96 model emulator with a NN, generated its Jacobians using the emulator, and applied them in the contexts of 4D-Var DA.

In this study, we develop a feedforward NN [22] with an input layer, one hidden layer, and the output layer to emulate the global shallow water (SW) dynamics in the Model for Prediction Across Scales (MPAS) framework. We train the model on fluid heights and winds from real atmospheric states and then developed the tangent linear and adjoint models of the trained neural network to formulate a continuous 4D-Var DA system, which are described in details in Section 2. The performance in analyzing initial conditions of this DA system as well as the forecast improvements are examined and shown in Section 3. Finally, Section 4 summarizes the study.

#### **2. Methodology**

#### *2.1. MPAS-SW Dynamics*

The SW dynamics under the MPAS spherical centroidal Voronoi tessellation (SCVT) was developed in [23,24]. The forward nonlinear continuous SW dynamics can be written as:

$$\frac{\partial h}{\partial t} + \nabla(h\mathbf{u}) = 0,\tag{1}$$

$$\frac{\partial \mathbf{u}}{\partial t} + (\mathbf{u} \nabla) \mathbf{u} + f \mathbf{k} \times \mathbf{u} = -\mathbf{g} \nabla (h+b) \tag{2}$$

where the fluid height *h* and edge-normal wind **u** are the model prognostic variables, *f* the Coriolis parameter *θ* being latitudes, and *b* the bottom height. This dynamical relationship has been widely applied in meteorology and oceanography. In this study, the height and wind fields at 500 hPa from ERA5 (European Centre for Medium-Range Weather Forecasts Reanalysis v5) [25] will first be interpolated into a 1000 km resolution mesh consisting of 611 cells (shown in Figure 1a) and the MPAS-SW model will make forecasts forward in time.

**Figure 1.** (**a**) Spatial distribution of the Spherical Centroidal Voronoi Tessellation (SCVT) mesh at 1000 km with 611 cells globally. (**b**) The neural network diagram showing the structure of the NN-based MPAS-SW model. The actual number of the neurons for the input and output layers is N = 1833, and N = 3666 for the hidden layer.

### *2.2. NN Emulator of MPAS-SW*

A feedforward NN is first formulated to emulate the SW dynamical behaviors reflected in MPAS-SW simulations. The atmospheric state in MPAS-SW is essentially represented by vectors. The forecasts are also vectors projected from those from a previous time. Similar to the GNN [15], the benefits of such as representation include intrinsical handling of the global spherical structure of the Earth, allowing to resolve the underlying multiscale interactions between cells, and the potential of learning multiresolution models. Thus, densely connected NN layers are chosen in this study. The NN model consists of three layers: an input layer of 1833 values, a hidden layer with 3666 neurons, and an output layer with 1833 neurons. The input layer holds the three variables including height *h*, zonal wind velocity *u*, and meridional wind velocity *v* over the global domain in the initial condition, all three of which are sampledat the 611 cell centers in the mesh shown in Figure 1a. Thus, each layer has 1833 (611 × 3) neurons. The output layer of the same 1833 dimension holds the same three variables (*h*, *u*, and *v*) of the 12-h forecast. The number of neurons in the

hidden layer, 3666, was determined based on heuristics. The hidden layer had an ELU activation function [26] and 10% dropout [27], where ELU can be denoted as

$$y = \begin{cases} a(e^{\mathbf{x}} - 1) & \text{if } \mathbf{x} \le 0 \\ \mathbf{x} & \text{if } \mathbf{x} > 0 \end{cases} \tag{3}$$

where *α* = 1 in this study, making both ELU and its derivative continuous. The authors also experimented with various other choices of continuous activation functions including identity, tanh, sigmoid, and Sigmoid linear unit. The ELU proves to yield the best performances emulating the shallow water dynamics. The structure of the NN model is illustrated in Figure 1b.

The NN is trained and validated with hourly 500 hPa height *h*, zonal *u*, and meridional wind velocities *v* from the ERA5 dataset over a 60 years (1959–2019) as features and the corresponding 12-h MPAS-SW as targets, totaling 534,697 samples. The training underwent 60 epochs and a learning rate of 3 × <sup>10</sup>−<sup>4</sup> using the Adam optimizer. Training took 0.5 h on 1 NVIDIA T4 GPU.

The variations of the mean square error as the loss function with respect to the epochs is shown in Figure 2. The model was then independently tested on hourly 500 hPa height and wind fields from 2020 and 2021. The root mean squared error (RMSE) compared with the 12-h MPAS-SW forecasts were 6.32 m and 0.58 m/s in height and wind fields, respectively. Taking the 500 hPa atmospheric state at 00 UTC on 1 January 2021 (shown in Figure 3a) as the initial conditions, the 12-hour forecasts rendered by the NN and the MPAS-SW are given in Figure 3b and Figure 3c, respectively. The distribution of the atmospheric wave patterns from the NN visually resembles the MPAS-SW result to some extent. To further demonstrate the NN emulation of SW dynamics, Figure 4 shows the the differences between the 12-h forecasts and the initial conditions in the case of NN (Figure 4a) and MPAS-SW (Figure 4b). It can be seen that most of the variations in the 12-h forecasts from the initial conditions are captured in the NN results when compared with the actual MPAS-SW simulations.

**Figure 2.** Variations of the loss function with respect to the epochs in the NN training.

**Figure 3.** (**a**) The spatial distribution of the fields of height (shaded) and wind (vectors) at 00 UTC 1 January 2021. (**b**,**c**) The 12-h forecasts made by (**b**) the NN-based SW model and (**c**) MPAS-SW model.

**Figure 4.** The differences in height (shaded) and wind (vectors) between the 12-h forecasts by (**a**) NN-based SW and (**b**) MPAS-SW with respect to the initial conditions.

#### *2.3. The Tangent Linear and Adjoint Models*

A nonlinear forward forecast model can denoted as:

$$\mathbf{x}(t\_r) = \mathcal{M}(\mathbf{x}(t\_0))\tag{4}$$

It takes the initial model state **x**(*t*0) at time *t*<sup>0</sup> as the initial conditions and predicts the model state **x**(*tr*) at time *tr*. The corresponding tangent linear model is then:

$$
\Delta \mathbf{x}(t\_I) = \mathbf{M}(\mathbf{x}(t\_0)) \Delta \mathbf{x}(t\_0) = \frac{\partial \mathbf{M}(\mathbf{x}(t\_0))}{\partial \mathbf{x}} \Delta \mathbf{x}(t\_0). \tag{5}
$$

The tangent linear model predicts the perturbation distributions forward in time following the nonlinear trajectory given by the nonlinear forward model in Equation (4). The adjoint model is simply the transpose of the tangent linear model as follows [28,29]:

$$
\Delta \hat{\mathbf{x}}(t\_0) = \mathbf{M}^T(\mathbf{x}) \Delta \hat{\mathbf{x}}(t\_r) \tag{6}
$$

The adjoint model simulates backward in time following the nonlinear trajectory and yields the sensitivity distributions in initial conditions at *t*<sup>0</sup> to a user-specified response function at time *t* where *t* > *t*0. In the cases where prediction models simulate complex behaviors, the tangent linear and adjoint models are developed at source code levels line-by-line. When the adjoint model is applied in a 4D-Var data assimilation system, the simulation propagated forward in time by the nonlinear forecast model will first be compared with existent observation at the observation time. The discrepency, as a sensitivy or forcing term, will then be taken by the adjoint model and be propagated backward in time in a dynamically consistent manner to the model initial time to inform how the initial condition should be adjusted so the simulation at the observation time can agree closer to the observation. The same forward-backward implementation will be repeated with multiple iterations until an optimal solution is found, which will be discussed further in the next subsection.

In the case of densely connected neural networks, an individual layer in the forward model can be rewritten as:

$$\mathbf{y} = F(\mathbf{x}\mathbf{W} + \mathbf{b})\tag{7}$$

where **x** is the input of the layer, **W** and **b** the weights and biases resulted from the training, and *F* denotes activation functions. The tangent linear model in this case becomes:

$$
\delta \mathbf{y} = \frac{\partial F}{\partial (\mathbf{x} \mathbf{W} + \mathbf{b})} \delta \mathbf{x} \mathbf{W} \tag{8}
$$

The adjoint model is then:

$$\delta \hat{\mathbf{x}} = \delta \hat{\mathbf{y}} \mathbf{W}^T [\frac{\partial F}{\partial (\mathbf{x} \mathbf{W} + b)}]^T \tag{9}$$

A multilayer or deep NN is simply a repetition of the described above. In this study, the hidden layer has an ELU activation function and the output layer has a linear activation function, Equations (7)–(9) can be further reduced to:

$$\mathbf{y} = \mathbf{x}\mathbf{W} + \mathbf{b} \tag{10}$$

$$
\delta \mathbf{y} = \delta \mathbf{x} \mathbf{W} \tag{11}
$$

$$
\delta \hat{\mathbf{x}} = \delta \hat{\mathbf{y}} \mathbf{W}^T \tag{12}
$$

It is to be noted that, when a variational data assimilation system is aimed to be developed, the activation function of choice is preferred to be continuous, such as the ELU defined in Equation (3), as discontinuity (like in the case of Rectified Linear Unit) in activation functions or models in general will make the tangent linear approximation invalid and cause the data assimilation system difficulty or failure to converge when searching for the optimal solution.
