*Article* **Observation Strategies Based on Singular Value Decomposition for Ocean Analysis and Forecast**

**Maria Fattorini 1,2 and Carlo Brandini 1,2,\***


Received: 31 October 2020; Accepted: 3 December 2020; Published: 8 December 2020

**Abstract:** In this article, we discuss possible observing strategies for a simplified ocean model (Double Gyre (DG)), used as a preliminary tool to understand the observation needs for real analysis and forecasting systems. Observations are indeed fundamental to improve the quality of forecasts when data assimilation techniques are employed to obtain reliable analysis results. In addition, observation networks, particularly in situ observations, are expensive and require careful positioning of instruments. A possible strategy to locate observations is based on Singular Value Decomposition (SVD). SVD has many advantages when a variational assimilation method such as the 4D-Var is available, with its computation being dependent on the tangent linear and adjoint models. SVD is adopted as a method to identify areas where maximum error growth occurs and assimilating observations can give particular advantages. However, an SVD-based observation positioning strategy may not be optimal; thus, we introduce other criteria based on the correlation between points, as the information observed on neighboring locations can be redundant. These criteria are easily replicable in practical applications, as they require rather standard studies to obtain prior information.

**Keywords:** singular value decomposition; data assimilation; ocean models; observation strategies; ocean forecasting systems; ocean Double Gyre; 4D-Var; ROMS

#### **1. Introduction**

In recent years, there has been a growing demand for oceanographic forecast data [1,2], which comes from different public and private subjects for operational oceanography purposes. This request stimulates the production of reliable predictions of physical and biogeochemical ocean variables to support activities such as search and rescue operations, ocean energy, fisheries, and environmental monitoring and pollution control. Observations play an essential role in operational systems as they also allow evaluating the reliability of predictions. The most important initiative in this context is the Global Ocean Observation System (GOOS), which includes several regional observation components providing data to global and basin-scale operational services, as well as to regional downstream services [3].

Operational oceanographic services both at the basin and regional scales improve their forecast reliability when the model forecast is properly initialized with fields obtained through a data assimilation procedure. Data assimilation (DA) combines observations and models first-guess-weighted by their respective accuracies to obtain the best unbiased estimation of the ocean state. In a DA scheme, the observations correct the trajectory (first guess) according to their influence, which mainly depends on the observation and model error covariance matrices. As a consequence, DA can be useful to better control the error growth of the state trajectory with respect to the real evolution. Furthermore, in the operational practice, a common procedure of initializing a simulation starting with external data (e.g., climatology, objective analysis, model analysis, etc.) requires a spin-up interval during which

the solution is not usable. DA schemes as 4D-Var reduce the spin-up effects (keeping a dynamical consistency between analysis and model equations) and also reduce model uncertainties.

Large amounts of data come from satellite observations (mainly Sea Surface Temperature and Sea Surface Height), which have some intrinsic limitations (surface-limited observations, revisiting times). Many parameters are only observable by collecting in situ observations through specific sensor networks that integrate satellite observations with data along the water column and at higher frequencies. The main limitation of in situ observation networks is their high cost for installation and maintenance over time; it is very important, therefore, to design an in situ observing system that maximizes the impact of the observations in the forecast, minimizing the cost.

Ocean models can also be used to evaluate both existing and new observing networks through different methodologies [4]. Observing System Experiments (OSEs) compare analysis obtained by eliminating only a part of the observations with the analysis obtained by assimilating the entire dataset to understand the impact of the omitted observations. Observation System Simulation Experiments (OSSEs) use "synthetic" observations to evaluate the benefit of assimilating observations from instruments/networks not yet installed. Adjoint-based techniques and ensemble-based methods can be used to study observation sensitivities and the impact on assimilated systems, contributing to the design of observing systems [4–6].

As different observations have different impacts when they are assimilated in an ocean model, a major problem is designing an observation network that provides data giving the best results (i.e., fewer errors) when they are assimilated. The positioning of the observing system is indeed somehow related to the unstable modes, which deserve more than others to be corrected. Since the fundamental milestone made by Lorenz in 1965 [7], it is well known how the divergence in chaotic systems rises from the unstable directions of the state trajectory where small errors in initial conditions significantly grow, leading to very different final states. This places a time limit to the predictability of the system state, which is usually evaluated by the largest Lyapunov exponents. The assimilation of observations attempts to prolong that time limit [8]. Some significant errors could decay over time, whereas smaller errors could intensely grow and produce a heavy impact on forecast reliability. The growth of the divergence between model evolution and the real state of the system is driven by these unstable directions, rather than by the largest components of the error embedded in the predicting system [9,10]. Indeed, the structure of the fast-growing perturbations is a dynamically evolving field and depends on the flow regime, as it derives from the position of the current state on the attractor and varies over time [7,11,12].

For what concerns observation strategies, we can expect that a suitable positioning of observation devices in areas in which error in the initial condition is fast-growing may better control this growth. Such a choice can be performed on the basis of the study of perturbation growth. Hansen and Smith [13] showed that for sufficiently small errors, observation strategies based on system dynamics produce better results than strategies based on error estimates.

A notable contribution to this field was made by Farrell and Joannou [14,15] in their General Stability Theory (GST) of a dynamical system, in which they extended the traditional modal stability theory to include transient growth processes. The authors identified the decomposition to singular values (SVD) as a suitable tool for treating perturbation growths in geophysical fluid systems. A variation of this method considers the calculation of Hessian singular vectors, which identify the errors in initial conditions that evolve into the largest forecast errors [6].

The existence of large singular values indicates that small errors in the initial background state can grow very rapidly, reducing the system predictability, and the respective singular vectors indicate the areas where disturbances grow faster. Analyzing Singular Vectors (SVs) appears strategic to increase model predictability by giving an indication of where it is more important to reduce errors in initial conditions [6,16].

The application of SVD to select observations has been already tested in a number of studies, mostly related to operational aspects of atmospheric forecasting systems [5,17–21]. A review of

observation strategies [22] confirmed the utility of SVD information in choosing the observations to be assimilated. Bergot et al. [23] considered the SVD as a useful tool to identify areas where assimilating even a few observations is able to significantly reduce forecast errors. An important portion of literature about the topic of adaptive observations was originated by the first experimental reference, the so-called Fronts and Atlantic Storm Track Experiment (FASTEX) [24–26]. Other experiments were carried out by assimilating additional observations from aircraft in regions characterized by rapidly growing SVs [19,27,28].

In this work, we test some possible observation strategies of a simplified ocean system with the aim of establishing an optimal configuration of an in situ observing network able to reduce forecasting uncertainties through DA. For simplicity, we limit this study to velocity observations. In particular, our goal is to achieve such an optimal configuration using a limited number of observation points, as these may have a significant cost, in order to ensure the greatest possible benefit to an integrated assimilation/forecasting system. The benefit is measured with respect to the short-/medium-range forecast (analysis and forecast cycles of a few days), as required in the operational practice. In order to select the possible observation points, the proposed strategy is based on the SVD and on a maximum correlation among the horizontal velocities, which is traduced into a minimum distance, variable over the model domain. Indeed, we verify that an observation strategy based on SVD may fail if it is not accompanied by other considerations linked to the flow structure, and such a combination of SVD analysis with a correlation analysis can be used to limit redundant observations.

The experiments are carried out by using the ocean model ROMS (Regional Ocean Modelling System, www.myroms.org) [29], which already includes suitable routines for the SVD computation [30]. We perform a set of numerical experiments assimilating different datasets and investigate the effects on model results.

In Section 2, the configuration of the experiment is presented, and in particular, the description of the model, the DA scheme, and the proposed strategy to place in situ observation points.

In Section 3, the results of all experiments are reported; our best strategy is compared to a random localization strategy and also to a selection procedure based on SVD and on a minimum fixed distance among observations.

#### **2. Materials and Methods**

#### *2.1. The Model Set-Up*

The reference model used in this work is an idealized ocean model, widely known as Double Gyre (DG). Although it is conceptually much simpler than realistic simulations, it is still relevant from an oceanographic point of view. DG simplified dynamics have been used as an idealized case to reproduce the seasonal and interannual oscillations of the large-scale circulation in the ocean, useful for the climate system predictability [30].

This configuration is also used by the ROMS developers [31] as a test case to introduce the functionalities of the tangent linear model and its adjoint. This simplified ocean system is strongly barotropic, and no relevant differences can be found in different vertical layers. The basin has a flat bathymetry 500 m deep with all closed boundaries. The model domain is a large rectangular basin 1000 km in size in the east–west direction and 2000 km in the north–south direction; it is discretized horizontally in 56 × 110 cells and vertically in 4 equally spaced s-levels. The model is forced by a constant zonally uniform wind stress with a positive zonal component at its midaxis, which is inverted to a negative zonal component approaching the northern and southern boundaries, as defined by a sinusoidal function of latitude:

$$\tau\_{\chi} = -\tau\_0 \cos\left(\frac{2\pi y}{L\_{\chi}}\right) \tag{1}$$

where τ<sup>0</sup> = 0.05 N/m<sup>2</sup> and *Ly* is the meridional extent of the basin. This particular wind distribution induces two large interacting vortices with a scale of about 1000 km: a subpolar cyclonic gyre and a subtropical anticyclonic gyre, whose stationary depends on the eddy viscosity values.

The model solves the 3D primitive equations in a beta-plane approximation centered at 45◦ N. Density profiles are defined by an analytically stable profile of the active tracers, as described by Moore et al. [31]. The advection for both 3D momentum and tracers in the horizontal and vertical components is implemented by, respectively, the third-order upstream scheme along constant S-surfaces and the fourth-order centered scheme. Horizontal turbulent processes are parametrized by a harmonic operator whose horizontal eddy viscosity and diffusivity are equal to 160 m2/s. Although forced by a steady wind, the model solution depends on the eddy viscosity value, and circulation passes from a stationary pattern (high eddy viscosity) to an oscillating behavior (low eddy viscosity), with the formation of smaller-scale vortices and the shifting of the main current patterns. In fact, the DG circulation shows a bifurcation [32], corresponding to a critical value of the Reynolds number. For values lower than the critical value, the flow converges quickly to a unique steady solution, whereas for values above the critical value, instabilities occur, and the symmetry of the structure of the subpolar and subtropical cells is broken. The vertical turbulent mixing is parameterized by coefficients of vertical viscosity and diffusion of 1 m2/s; linear bottom friction is introduced with a bottom drag coefficient of <sup>8</sup> <sup>×</sup> <sup>10</sup>−<sup>7</sup> <sup>m</sup>/s. To ensure model stability, the numerical time steps are set, respectively, to 45 s (barotropic time step) and 900 s (baroclinic time step).

The model is initially started through the steady solution (Figure 1a), obtained by a 20-year-long simulation with a high eddy viscosity, equal to 1280 m2/s, which ensures a steady circulation that is symmetrical with respect to the zonal axis. For the following 10-year run, the eddy viscosity is then decreased to a lower value of 160 m2/s: in this case, the circulation loses symmetry as it becomes unsteady and characterized by meandering where the two original gyres move through the domain with other gyres arising.

**Figure 1.** Surface current field (arrows) and Sea Surface Height (SSH) (m) of (**a**) the steady solution, (**b**) the climatological state, and (**c**) the initial state of the Nature Run for the first assimilation window.

For our model experiments we define a couple of independent model runs:

1. A Nature Run (NR), which is the model initialized by the last snapshot of the 10-year run (Figure 1c). This is considered the true state of the ocean, a virtual reality from which synthetic observations are extracted;

2. A Free Run (FR) or background, which is the simulation initialized by an initial velocity field obtained by averaging the unsteady solution of the 10-year run (Figure 1b). This represents the common and simplest way to initialize numerical simulations by means of a "climatological" time-averaged solution.

In each experiment, a set of observations is extracted from the NR using the observation strategies discussed below. Such a synthetic dataset includes velocity observations in the form of ocean current profiles at a frequency of 15 min, which are assimilated using the FR as the background, thus producing an analysis. In real ocean-observing systems, such kinds of observations are normally taken by means of Acoustic Doppler Current Profilers (ADCPs). The analysis is performed in a 5-day assimilation window that provides initial conditions to a subsequent 5-day forecast.

The positions of observation instruments are fixed in each experiment and chosen by adopting three different strategies: (1) randomly, (2) in the area of maximum dominant singular vectors and imposing a minimum distance among the positions, and (3) in the area of maximum dominant singular vectors, imposing a maximum limit of correlation between the velocity time series in each position. For what concerns the first set of experiments (random positioning), the positions are obtained by the rand function in MATLAB©, assuming a minimum distance is equal to the grid cell size (in this case, around 18 km). To reduce the dependence of the results on a particular random spatial configuration, we repeat each experiment by assimilating different datasets characterized by the same number of (virtual) observation instruments. Finally, to assess the selection procedure for different hydrodynamic states, the assimilation test is repeated in different time windows.

The assimilation of synthetic observations is executed through the incremental formulation of the 4D-Var based on the Lanczos algorithm (ROMS-IS4DVar). In such a procedure, the increments to add to the control vector to minimize the cost function are computed iteratively [5]. The control vector corresponds to the initial state, so only the initial conditions were adjusted by assimilating data.

A maximum number of 10 inner loops and 2 outer loops are set up, and the assimilation process stops when the minimization of the cost function gradient reaches a given tolerance that we set equals 10<sup>−</sup>4, as we verified that this limit is sufficiently adequate for convergence. The presence of observation errors is considered in the observation error matrix as we assume implicitly that observations are affected by an error of 0.01 m/s.

The background error covariance matrix *Bx* is factorized by means of the univariate correlation matrix *C*, the diagonal matrix of the prior error standard deviations Σ*x*, and the multivariate balance operator *Kb*, as described in [5]:

$$B\_{\mathbf{x}} = \mathcal{K}\_{\mathbf{b}} \Sigma\_{\mathbf{x}} \mathbb{C} \Sigma\_{\mathbf{x}}^{T} \boldsymbol{K}\_{\mathbf{b}}^{T}. \tag{2}$$

The background error standard deviation Σ*<sup>x</sup>* is defined by using the standard deviation of the state variable fields during the 10-year-long (unsteady) simulation, a period long enough to compute meaningful circulation statistics.

The correlation matrix *C* is in turn factorized by a diagonal matrix of grid box volumes *W*, a horizontal and vertical correlation function model *Lh* and *Lv*, and a matrix of normalization coefficients Λ, as:

$$\mathbb{C} = \Lambda \mathbf{L}\_{\upsilon}^{1/2} \mathbf{L}\_{h}^{1/2} \mathcal{W}^{-1} \mathbf{L}\_{h}^{T/2} \mathbf{L}\_{\upsilon}^{T/2} \Lambda^{T}. \tag{3}$$

The normalization coefficients are used to ensure that the diagonal elements of the associated correlation matrix *C* are equal to unity and they are computed through the so-called "exact method," in which the horizontal and vertical isotropic decorrelation scales imposed equal 30 km and 100 m, respectively, for all the state variables.

#### *2.2. Observation Strategies*

The proposed criteria for identifying the most suitable positions to install observation instruments join two requirements: (a) finding areas characterized by high values of the dominant SVs; (b) imposing a maximum limit to the correlation between the velocity time series at the observation points.

For what concerns the first requirement, we resort to the Singular Value Decomposition (SVD) of the tangent propagator to identify the directions of maximum error growth in a given time interval, according to the Generalized Stability Theory (GST) by Farrell and Ioannou [14,15] for nonautonomous systems. SVD decomposes the matrix L (in this study, it is the tangent propagator) in three matrices that satisfy LV = US [33]. The matrices V and U are formed by the eigenvectors of LTL and LLT, respectively, which are autonomous and symmetric for construction. Therefore, their eigenvectors form two orthogonal bases of the domain space and the range space, respectively. For this property, any disturbance at the initial time can be written as a linear combination of the initial singular vectors in the domain space, whereas its evolution can be written as a linear combination of the final singular vectors in the range space. Furthermore, the eigenvalues are the same for both matrices, LTL and LLT, and are called singular values; their square root corresponds to the growth factor of the relative initial singular vector, as they are transformed by L. Therefore, the growth rate of all the perturbations is confined by the fastest-growing singular mode, characterized the higher singular value. The dominant initial singular vector defines the direction of maximum growth error in the interval [t1,t2]. This calculation allows us to assess the fastest-growing disturbance among all those possible during a given finite time interval, called optimization time. The SVD computation is controlled by two main parameters: the norm, by which the growth of the singular vectors is evaluated, and the optimization time. The dominant initial singular vector is calculated as the singular vector that maximizes a norm at the final time of the interval, as adopting different norms leads to different SVs. In most oceanographic studies, the norm used is the total energy, the same we apply in this work, although other norms have been used depending on the particular aim, such as kinetic energy, enstrophy, and the Hessian norm [5,6,17,33,34]. More details on the SVD can be found in Farrell and Ioannou [14,15].

The number of singular modes of the system is of the order of 105, equivalent to the dimension of the model state. The leading singular vectors are computed in ROMS through the Lanczos algorithm [35] by integrating the tangent linear model forward in time and the adjoint model backward in time a sufficient number of times for the convergence of the algorithm [31].

The SVD is applied with respect to the tangent model of the free-run circulation starting from the climatological state so it is the same for all the time windows. The 200 dominant SVs are computed for different optimization times (5, 10, 20, and 60 days). In Figure 2, the singular values computed with optimization times of 5 and 60 days are reported as examples, and we see that the choice of limiting the calculation to the first 200 SVs, containing the main information on the perturbation evolution, is adequate as it also avoids the computation of a huge number of SVs.

**Figure 2.** The 200 dominant singular values computed on an optimization time of 5 days (**a**) and an optimization time of 60 days (**b**).

Figure 3 shows the sum of projections onto the horizontal surface velocity components of all the dominant initial singular vectors (upper maps) and the dominant final vectors (bottom maps), weighted with the corresponding singular values. SVs and singular values are computed from the free run by considering different optimization time (Top) of 5, 10, 20, and 60 days. The structures of the initial and final singular vectors computed on a short Top are similar to each other and concentrated near the convergence of the currents along the western side, where the two branches of the currents merge and change their direction (from meridional to zonal).

**Figure 3.** Sum of projections on the horizontal surface velocities of the 200 dominant initial singular vectors, for different optimization times (**a**–**d**). Similar sum of projections computed for the first 200 final singular vectors (**e**–**h**). Singular values are used to weight Singular Vectors (SVs).

The difference between the initial and the final singular vectors maps grows for increasing optimization times, as well as the singular values (Figure 2). The perturbations computed for 60 days evolve much more than those computed over a shorter period, in agreement with the higher singular values obtained for larger optimization times. However, significantly large singular values, even for the 5-day optimization window, are due to the fact that the initial period of the background run is subject to a strong tendency to change the circulation structure, especially close to the convergence area. Indeed, the simulation starts from the climatological state (Figure 1b) and the circulation is subject to strong variations which are concentrated in the central–western area.

Concerning the second requirement, we set a minimum distance among observations, defined on the basis of a maximum correlation between the time series of model velocities at the observation positions themselves.

The need to impose a minimum distance among in situ observations arises from the typical structure of the dominant singular vectors quite concentrated in specific areas. The highest values of the SV component on the surface velocity are strictly localized in the middle of the western boundary where the ascending and descending currents converge (Figure 2). On the one hand, it is crucial to

control such areas of maximum error growth with sufficient detail. On the other hand, having too many observation points concentrated in this part would not give a strong benefit to the analysis since horizontal velocities in this area can be strongly correlated to each other. Therefore, a procedure only based on SV would lead to select too dense instrument positioning. Hence, some experiments are realized imposing a fixed minimum distance between the observation points, but this method does not avoid choosing sampling points excessively correlated. Moreover, we expect for most hydrodynamic fields of interest, the correlation between points is not homogeneous on the whole domain, as well as the suitable minimum distance among observations. This correlation is computed by the following steps:

(1) For each grid cell (*i*, *j*) we compute the spatial correlation of both the u and v velocity components with respect to all of the other grid cells (*h*, *l*), with *i h* and *j l*, and then we take the norm of the matrix:

$$\left| \begin{array}{cc} \langle \iota\_{ij} u\_{\text{ll}l} \rangle & \langle \upsilon\_{ij} u\_{\text{ll}} \rangle \\ \langle \iota\_{ij} \upsilon\_{\text{l}l} \rangle & \langle \upsilon\_{ij} \upsilon\_{\text{l}l} \rangle \end{array} \right| . \tag{4}$$

Each cell is associated with a single normalized correlation map, whose values range between 0 (uncorrelated grid cells) and 1 (perfectly correlated grid cells).

(2) A maximum value for such normalized correlation is imposed as a threshold to calculate for each grid cell the averaged distance beyond which the correlation is lower than the chosen threshold. In this way, all correlation maps for each grid cell can be transformed into a single map of distances. This map of minimum distances among in situ observation points is computed as the mean radius inside which the correlation of the variable of interest is higher than the imposed correlation threshold.

Figure 4 shows the map of distances used in this study (in km), computed by imposing a maximum correlation of the time series of velocities equal to 0.6. Note that distances are never too small (in this case, >100 km) and they increase moving away from the critical area of convergence.

**Figure 4.** Map of the correlation distances (km) related to a correlation threshold of 0.6.

#### **3. Results**

As described in the previous sections, we test three different observation strategies applied to an idealized ocean model (DG) to identify the observation network configuration, which gives rise to the best analysis and forecast.

To compare each test, we adopt the Taylor diagram representation [36], which is often used in the operational field and allows collecting the same graph three of the most used statistical indices: the correlation, the centered root-mean-square error, and the standard deviations between two series (considering the Nature Run as the target). As in this study, we compare the maps of values at the same time, and correlation must be understood as a spatial correlation. These statistics are computed on the surface velocity components at the final time of the assimilation window, which is the initial condition for the five-day forecast run.

In the first experiment, we test the assimilation of an increasing number of velocity profiles, randomly located using the criteria described in the previous section. We start with 20 observation instruments (i.e., velocity profilers or ADCPs). As randomness can produce datasets more or less impactful for DA, the test is repeated considering different positions.

For the five-day assimilation window, two points are highlighted in Figure 5:


**Figure 5.** Representation of the ensemble of the analyses obtained from the assimilation of different datasets of velocity fields from 20 observation points randomly positioned (blue points), the Free Run (red point), and the Nature Run (green point) in the first assimilation window. The orange circle indicates the M dataset, whereas the cyan circle indicates the U dataset, which Figure 6 refers to.

**Figure 6.** Vector maps of the analysis from the configuration network "M" (**a**) and "U" (**c**) at the end of the assimilation window (see Figure 5). The observation positions, taken from the Nature Run (NR), are represented in (**b**) (Dataset "M") and (**d**) Dataset "U").

We find an excellent capability of the DA algorithm (ROMS-IS4DVar) to adjust the initial condition and bring the evolution of the state of the system closer to the truth.

Figure 5 shows a wide spread between the analyses produced by assimilating different datasets, each corresponding to a different network configuration. We observe a significant spread of the results around such average, as analysis data can have a stronger (around 0.8 for Dataset M) or weaker correlation (around 0.6 for Dataset U). Some analyses although characterized by a high correlation to the NR have lower standard deviations than the NR, hence they poorly represent the true circulation structures.

As an example, in Figure 6, we report the snapshots of the circulation of both the worst and the best analysis obtained, respectively, from the synthetic datasets corresponding to the configuration networks "M" and "U."

This first set of experiments is extended through the assimilation of a growing number of randomly positioned ADCPs (40, 60, 80, 100, 150, and 200).

Looking at the averaged statistics of analyses (named AVG in Figure 7), we have a positive effect of DA in improving the estimation of the model state as the number of observation points increases. At the same time, the statistical indices within each ensemble are increasingly closer to each other.

**Figure 7.** *Cont.*

**Figure 7.** Representation of the ensemble of the analyses obtained from datasets of velocity measurements from, respectively, 40 (**a**), 60 (**b**), 80 (**c**), and 150 (**d**) observation points randomly positioned (blue points), Free Run (red point), and Nature Run (green point) in the first assimilation window.

The results of all tests with the same number of observation points are summarized by means of a number of average points, each representing the averaged statistics of the related ensemble in Figure 8. We observe a progressively better quality of analyses in terms of all statistical indices (correlation, RMSE, and standard deviation) and a progressively lower benefit in the assimilation of additional observations. Indeed, as the number of observations increases, the marginal improvement of the analyses decreases. Therefore, a suitable strategy to localize measuring tools is especially impactful in observation networks characterized by a few instruments, which is the case of most in situ observation networks used in operational oceanography.

**Figure 8.** Representation of the average point representing a different ensemble of (**a**) the analyses obtained from datasets of velocity measures characterized by a different number of observation points randomly positioned: 20, 40, 60, 80, 150, and 200; (**b**) the subsequent forecast.

Figure 9 shows the improvement of the quality of analysis and forecast for an increasing number of observations. Forecast reliability, in terms of correlation and error, is strictly dependent on the quality of initial conditions; therefore, it is quite proportioned to the analysis.

**Figure 9.** Correlation (**a**) and error (**b**) of the average point representing a different ensemble size (whose members are reported by single dots) of the analyses and the subsequent forecast obtained from datasets of velocity measures characterized by a different number of observation points randomly positioned: 20, 40, 60, 80, 150, and 200.

By considering individually each analysis, some overlapping areas between the statistical scores of different ensembles are also shown in Figures 7 and 9. This means that, in some cases, a significantly different number of assimilated data can approximately lead to the same improvement. Therefore, a relatively small number of well-positioned instruments can produce an analysis almost equivalent to that produced by a network of poorly located instruments, albeit larger. In fact, some analyses obtained with only 20 ADCPs have produced results equivalent to networks with 40 or even 60 ADCPs (Figures 7 and 9).

In the second set of experiments, the positions of in situ observation instruments are identified by two elements: (1) the highest values of the projection on the velocity components of the dominant singular vectors, and (2) a fixed minimum distance among the instruments. As we mentioned in Section 2.2, the need to impose a minimum distance among instruments arises from the typical structure of the dominant singular vectors concentrated in relatively small areas.

In some experiments, not reported in this paper, we sample our DG system by extracting most measurements in the area of highly dominant SVs, but the results are even worse than those obtained with randomly positioned observations.

Distances are provided in dimensionless units, as they are divided by the barotropic Rossby deformation radius LR = (gh)1/2/f ≈ 900 km. We repeat the ensemble for testing different minimum distances from 0.04xLR to 0.28xLR.

Following the present observation strategy, the first observation point is located where the projection of the dominant SV on the velocity is maximum, and the following observation points lie farther than the chosen minimum distance.

The results of this set of experiments are shown in Figures 10 and 11, in which we compare the analyses obtained in the first assimilation window (Figure 10) and, on average, for all the assimilation windows (Figure 11).

**Figure 10.** Representation in the Taylor diagram (**a**) and correlation graph (**b**) of the analyses obtained for the first assimilation window. Each case corresponds to a network configuration defined by the highest SVs and the minimum fixed distance, expressed as a fraction of the Rossby deformation radius. The green point and the green line correspond to the case of a variable distance based on the correlation map.

**Figure 11.** Taylor's diagram (**a**) and correlation graph (**b**) of the average statistics of the analyses obtained for all assimilation windows. Each case corresponds to a network configuration defined by the highest SVs and the minimum fixed distance, expressed as a fraction of the Rossby deformation radius. The green point and the green line correspond to the case of a variable distance on the correlation map.

Looking at the first assimilation window (0−5 days), the worse dataset corresponds to an imposed minimum distance of the order of the spatial resolution, which is 0.04xLR. By increasing this minimum distance, we progressively obtained better correlation up to a value approaching 0.8, in turn corresponding to a minimum separation of about 0.25xLR. The observation positions of both the worst and the best dataset are reported in Figure 12. Considering other assimilation windows, the results are quite similar: the worst correlations are usually obtained for short minimum distances, and the correlation tends to increase when we separate the observation positions. The maximum correlation is found in the range of 0.15−0.25xLR (Figure 13).

**Figure 12.** Observation points of dataset selected by the Singular Value Decomposition (SVD)-based procedure imposing a minimum distance of 0.08 LR (**a**), 0.24\*LR (**b**), and, finally, produced by the SV-based imposing the correlation distance of Figure 10 (**c**).

**Figure 13.** Correlation of analyses run to the Nature Run for the 10 cycles of the 5-day-long-assimilation windows where the observation networks were based on the correlation map of Figure 4 and the SVD computed for different optimization times: 5 days, 10 days, 20 days, and 60 days. The dashed lines represent the results for the random strategy plus a fixed minimum distance. The analyses are compared to each other and to the Free Run (blue line).

Taking the average of all curves referred to the whole set of assimilation windows, the best correlations are obtained by imposing a minimum distance of about 0.2 xLR (Figure 13).

Datasets selected by imposing a distance up to 0.08−0.1 xLR produce worse analyses than randomly selected observations; the same occurs with minimum distances larger than 0.3 xLR. Conversely, in the range between 0.1 xLR and 0.25−0.3 xLR, this selection procedure produces, on average, analyses more reliable than that corresponding to random positioning.

Finally, as a third set of experiments, we test a procedure, still based on SVD, in which we impose a maximum correlation between the time series of the observed variables at the observation positions, instead of a fixed minimum distance. Such a limitation on the correlation between the time series of model velocities at observation points is introduced by imposing a minimum distance variable over the domain, and it is computed as explained in Section 2.2.

The selection procedure for this set of experiments is based on the map in Figure 4. Starting from the observation point in which we compute the maximum value of the projection of the SV onto the velocity, we identify the minimum distance to place additional observations through such a map derived from the correlation analysis.

It is important to underline that, in this procedure, the position of observation points are uniquely determined once a correlation threshold is defined. Such a threshold should be itself a calibration parameter of the sampling strategy to be selected on the basis of the correlation value that may guarantee the best comparison between analysis and the (virtual) truth. Although the correlation threshold is not calibrated in this study, such a unique configuration of the observation points is found to be the one that gives rise, on average, to the best analysis.

The average value of such a variable correlation distance is around 300 km, which is about 0.3xLR, slightly outside the "best" minimum distance found for the second set of experiments in which a fixed separation among observations is set. However, this distance, on the map in Figure 4, varies between lower values (around 150−200 km), near the convergence area, and higher values (>600 km) near the northern, southern, and eastern edges. The highest SVs are located within the area of lower correlation distances that fall within the range of 0.15−0.20 LR that we find, empirically, as an optimal distance interval using the fixed distance criterion.

The third procedure is able, on average, to improve the quality of the analysis model with respect to both the random procedure, and to the SVD-based procedure with a fixed distance among observations.

The observation strategy is tested for repeated assimilation cycles, as in the normal procedures adopted in the operational practice. A summary representation of results is shown in Figure 13. The observation strategy based on the combination of SVD and correlation analysis (solid lines) in most cases gives the best performances with respect to any random positioning procedure (dashed lines).

We also assess the sensitivity of the results to the optimization time for the SVD computation. In this case, the difference among the datasets, in terms of correlation with respect to the NR, is not so relevant, but we find that the configuration of observation points obtained by the SVD on a shorter optimization period (5−10 days) gives the best results. In particular, the analysis corresponding to the SVD with an optimization time of five days (SV5 in Figure 13) always yields significantly better analyses than any other tested strategy.

#### **4. Discussion and Conclusions**

The marginal improvement of the reliability of an ocean forecasting system can be obtained by a proper design of the ocean-observing component.

The Singular Value Decomposition was already used by various authors in the field of Geophysical Fluid Dynamics (mainly in the atmosphere) for several purposes, including the identification of possible adaptive observation strategies. However, the analysis of the potential of this method is still rather lacking to provide effective and functional indications for the design of in situ observation networks.

In this work, we evaluate some possible SVD-based strategies to determine an optimal set of in situ observation points in the case only a limited number of observation tools are available. This situation is common in reality, given the high cost of installing and managing in situ observation networks, especially in the oceanographic field.

We compare three observation strategies aimed at reducing the forecast uncertainty obtained through an idealized Double-Gyre ocean model, with repeated analysis and forecast cycles, using the variational assimilation scheme ROMS-IS4DVar.

We first proceed to evaluate the benefit linked to the assimilation of randomly positioned observations. The assimilation algorithm in use always produces a positive improvement in the estimation of the state of the system. The effectiveness of this improvement is not straightforward, as it depends in a complex way on the number of observation points and also on the location of these points in the model domain. This is especially evident in case only a limited number of observations is available.

Having a limited number of observation tools, and looking for the combination of positions that gives maximum benefit to DA, we assume that a fundamental indication for selecting observation points can be provided by the study of the areas in which the maximum error growth occurs. SVD is an excellent method for identifying these areas. The computation of the dominant Singular Vectors (SVs), and in particular of its projection on the physical components of interest, i.e., the velocity field, can give important information about error dynamics in the limit of validity of the linear tangent model. However, as the highest values of such SVs components can be concentrated in small areas, information obtained from points too close to each other is likely to be too correlated. To avoid this effect, we test two criteria:


Further improvements of the last criterion can be achieved through an accurate calibration of the threshold correlation parameter. However, even when adopting a threshold parameter of the first attempt, the obtained results are, on average, better than any formerly adopted strategy.

The extension of this method to real applications must take into account other factors, such as the presence of other variables of interest or a more accurate characterization of the observation error. In cases of ocean models more complex than the ocean DG, when baroclinic effects and density variations are more important, an SVD-based observation strategy should also evaluate the projection of the dominant SV on other variables, such as temperature and salinity, as any observation strategy cannot disregard the acquisition of density profiles. The application of this method to real ocean systems will also require a careful characterization of measurement errors, estimated from the performances of real observation instruments.

Testing such criteria to the design of observation networks, as in the standard Observing System Simulation Experiments (OSSEs) used for simulating the possible benefits of observing systems, could be of great interest. Indeed, most existing ocean observation networks are not designed from the very beginning using objective criteria to optimally support analysis/forecast models. Suitable design strategies are therefore needed for both making up new observation systems and expanding the capabilities of existing observation networks in order to improve their efficiency for data assimilation.

**Author Contributions:** Conceptualization, M.F. and C.B.; data curation, M.F.; formal analysis, M.F.; investigation, M.F.; methodology, M.F. and C.B.; supervision, C.B.; writing—original draft, M.F.; writing—review and editing, C.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** The authors wish to thank their colleagues Michele Bendoni and Matteo Vannini for their support in revising the manuscript, as well as the reviewers for useful comments and suggestions.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **A Multivariate Balanced Initial Ensemble Generation Approach for an Atmospheric General Circulation Model**

**Juan Du \*, Fei Zheng \*, He Zhang and Jiang Zhu**

International Center for Climate and Environment Science, Institute of Atmospheric Physics, Chinese Academy of Sciences, Beijing 100029, China; zhanghe@mail.iap.ac.cn (H.Z.); jzhu@mail.iap.ac.cn (J.Z.)

**\*** Correspondence: dujuan10@mail.iap.ac.cn (J.D.); zhengfei@mail.iap.ac.cn (F.Z.)

**Abstract:** Based on the multivariate empirical orthogonal function (MEOF) method, a multivariate balanced initial ensemble generation method was applied to the ensemble data assimilation scheme. The initial ensembles were generated with a reasonable consideration of the physical relationships between different model variables. The spatial distribution derived from the MEOF analysis is combined with the 3-D random perturbation to generate a balanced initial perturbation field. The Local Ensemble Transform Kalman Filter (LETKF) data assimilation scheme was established for an atmospheric general circulation model. Ensemble data assimilation experiments using different initial ensemble generation methods, spatially random and MEOF-based balanced, are performed using realistic atmospheric observations. It is shown that the ensembles integrated from the balanced initial ensembles maintain a much more reasonable spread and a more reliable horizontal correlation compared with the historical model results than those from the randomly perturbed initial ensembles. The model predictions were also improved by adopting the MEOF-based balanced initial ensembles.

**Keywords:** MEOF; initial ensemble; ensemble spread; LETKF; data assimilation

**Citation:** Du, J.; Zheng, F.; Zhang, H.; Zhu, J. A Multivariate Balanced Initial Ensemble Generation Approach for an Atmospheric General Circulation Model. *Water* **2021**, *13*, 122. https:// doi.org/10.3390/w13020122

Received: 30 October 2020 Accepted: 29 December 2020 Published: 7 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

The ensemble Kalman filter (EnKF) data assimilation approach was introduced by Evensen in 1994 [1], which is a Monte-Carlo approach and has the potential for efficient use on parallel computers with large-scale geophysical models [2–8]. The EnKF method uses an ensemble of model forecasts to estimate the background error covariances and optimizes the background with the available observations. So it is easy to implement (no adjoint models are required compared with the three-dimension variational data assimilation [2,9]) and handles strong non-linearities better than other known Kalman filter techniques for large-scale problems [10].

EnKF was first applied to an atmospheric model by Houtekamer and Mitchell [11]. After that, it has rapidly become a promising choice for the operational numerical weather prediction systems. The square root filter (SRF) method of EnKF without perturbed observations (deterministic filters) was proposed by assimilating the observations serially [12,13], and then the EnKF method with perturbed observations (stochastic filters) was applied to a pre-operational system [14]. A local ensemble Kalman filter (LEKF) method that assimilates observations simultaneously was proposed by Ott et al. [15]. Furthermore, the local ensemble transform Kalman filter (LETKF) which uses the ensemble transform Kalman filter (ETKF) approach was proposed to further accelerate LEKF [16,17]. The LETKF assimilates observations within a spatially physical local volume at each model grid point simultaneously and does not require an orthogonal basis which significantly enhances the computational efficiency with parallel implementation [17,18].

For the initial perturbation generation, several kinds of methods have been developed, such as error breeding [19], singular vectors [20], perturbed observations [21] and random perturbations [5,21–23]. The performances of these methods were illustrated in several numerical weather prediction models with different complexities [24–27]. Zheng and

Zhu proposed a multivariable empirical orthogonal function (MEOF) based model error perturbation to generate perturbed model errors and then applied it to a global spectral atmospheric model with real observations [28,29]. It should be realized that how to maintain the physical relationships of the different model variables induced by the initial perturbations and how to provide a reasonable background covariance are still an important problem for the ensemble data assimilation process.

In this work, the local ensemble transform Kalman filter approach has been implemented for an atmospheric general circulation model developed by the Institute of Atmospheric Physics (IAP AGCM version 4). A MEOF based balanced perturbation generation method is adopted for generating the initial ensembles, compared with the spatially random perturbation method [5]. The remainder of this paper is structured as follows: In Sections 2 and 3, the forecast model and the LETKF data assimilation scheme are briefly described respectively. In Section 4, the implementation of the initial perturbation generation scheme based on the multivariate empirical orthogonal function (MEOF) is introduced. In Section 5, the spatially-correlated random perturbation scheme and the MEOF-based balanced perturbation scheme are both applied to the AGCM model results to generate the initial ensemble. The ensemble spread and horizontal correlation of the initial ensembles are compared for the two methods. And the LETKF data assimilation scheme is applied to the AGCM model with the conventional observation data. The characteristics and effects of the random and MEOF based initial ensemble generation methods are illustrated respectively. The data assimilation results using the two different initial ensembles are also shown in this section. Summary and conclusions are drawn in the final section.

#### **2. The Forecast Model**

The atmospheric general circulation model used here is the IAP AGCM version 4 as a component of the Chinese Academy of Sciences (CAS) earth system model (ESM). The model was applied to the simulation of atmospheric circulations and climate, such as summer precipitation and monsoons [30,31]. It is a global grid-point model using finitedifference scheme with a terrain-following *σ* coordinate. A latitude-longitude grid with Arakawa's C grid staggering is used in the horizontal discretization [32–34]. The formulation of the governing equations and the finite-difference schemes have several novel features in the IAP AGCM. The model equations are based on the baroclinic primitive equations with subtraction of standard stratification. The purpose of subtracting the standard stratification in the dynamical core is to reduce truncation errors, especially over regions of high terrain. And the IAP model conserves total available energy (sum of kinetic energy, the available potential energy, and the available surface potential energy) rather than total energy. To maintain the conservation of the total available energy, a variable substitution method named the IAP transform is adopted in the numerical design. The model resolution adopted here is 1 degree by 1 degree. The nonlinear iterative time integration method described in [35] is used in the model. The timestep adopted in the numerical simulation here is 1200 seconds. The prognostic model variables are temperature, surface pressure, wind velocity and specific humidity.

#### **3. Data Assimilation Scheme**

The Local Ensemble Transform Kalman Filter (LETKF) algorithms used here are based on the work of Hunt et al. [17]. An important advantage of LETKF schemes compared to EnKF is their efficiency in parallel computing. Because LETKF separates the entire global grid into independent local regions, ideally they have the total parallel efficiency [18]. In this section, we introduce the main idea of LETKF briefly.

The ensemble members are defined as *xi* ∈ *n*(*<sup>i</sup>* = 1, ··· , *<sup>N</sup>*), where N is the ensemble size and n is the dimension of the model state. The ensemble matrix X can be constructed by the model states of the ensemble as:

$$X = (\mathbf{x}\_1, \mathbf{x}\_2, \dots, \mathbf{x}\_N) \in \mathfrak{R}^{n \times N} \tag{1}$$

The anomaly matrix is

$$X' = X - X'\tag{2}$$

where *X* is the ensemble mean vector.

To update analysis states at every grid point, the LETKF assimilates only observations within a certain distance from each grid point. Here we use the subscript *l* to denote a quantity defined on such a local region centered at an analysis grid point. The analysis mean is

$$\overline{\mathfrak{X}}\_{l}^{a} = \overline{\mathfrak{X}}\_{l} + \mathfrak{x}\_{l} P\_{l}^{a} (\boldsymbol{\Upsilon}\_{l})^{T} \boldsymbol{R}\_{l}^{-1} (\boldsymbol{y}\_{l}^{o} - \boldsymbol{y}\_{l}) \tag{3}$$

And the analysis error covariance matrix *P<sup>a</sup> <sup>l</sup>* is

$$P\_l^u = [(\mathbf{Y}\_l)^T \mathbf{R}\_l^{-1} \mathbf{Y}\_l + (\mathbf{N} - \mathbf{1})\mathbf{I}/\rho]^{-1} \tag{4}$$

where *Rl* is the observation error covariance matrix. The observation vector is *<sup>y</sup>* ∈ *m*. *Y* = (*y*1, ··· , *yN*) and H is the observation operator which interpolates the model state to the observation space. *ρ* is the multiplicative inflation factor. Within a local region, space localization is carried out by multiplying the inverse observation error covariance matrix with a factor that decays from one to zero as the distance of the observations from the analysis grid point increases [36].

#### **4. Multivariable Balanced Initial Perturbation Scheme**

Based on the multivariate empirical orthogonal function MEOF [28], a MEOF-based multivariable initial perturbation method is adopted here to generate a balanced initial ensemble state for the LETKF data assimilation, which can make the ensemble members maintain a reasonable spread as the forecast model integrates. For the MEOF analysis, the snapshots of all the model variables are put in one single vector to make the EOF analysis, instead of making EOF analysis for the model variables individually. The spatial distribution of the model snapshots, which are derived from the MEOF analysis, and the 3-D random perturbation are combined together to generate a balanced perturbation field. The detailed implementation steps of the method are described as follows:

$$Q\_i(\mathbf{x}, y, z, v) = D(\mathbf{x}, y, z, v) + \sum\_{j=1}^{N\_m} \sigma\_j(\mathbf{z}, v) \phi\_j(\mathbf{x}, y, z, v) \omega\_{i, j, i} i = 1, \dots, N \tag{5}$$

where *Qi*(*x*, *y*, *z*, *v*) represents the generated initial perturbation field for the ith ensemble member, and *D*(*x*, *y*, *z*, *v*) represents the initial model state. *σj*(*z*, *v*) represents the standard deviation of model variables in different model layers, which can be calculated from the time coefficients of the MEOF analysis. *Nm* is the chosen mode number according to the MEOF analysis. *φj*(*x*, *y*, *z*, *v*) is the analyzed spatial MEOF mode of the model state variables in different layers. *ωi*,*<sup>j</sup>* is a one-dimension random vector with a mean equal to 0 and variance equal to 1, and the random vectors *ωi*,*j*(*j* = 1, ... , *Nm*) should be independent to make the MEOF modes orthogonal. x, y and z represent the 3-D coordinate, v represents different model variables, and N is the ensemble size.

In practice, we could derive the departures of the model integration results from their average in each model layer first to generate the balanced initial perturbation fields. The standard deviations *σ*(*z*, *v*) of the model variables in each model layer could be calculated to normalize the model variables in all the model layers. The MEOF analysis is performed for the normalized model variables and the spatial modes *φ*(*x*, *y*, *z*, *v*) could be obtained. Finally, we can apply the above equation to generate the initial ensemble perturbation fields. Because the perturbations are a combination of the spatial distribution of all the model variables, the initial ensembles were generated with a reasonable consideration of the physical relationships between different model variables. Then, we can add the derived MEOF based perturbations to the initial state of the model, which are the model's prognostic variables (i.e., ps, U, V, T, q). After the initial ensembles are

generated, we integrate the model for six hours and use the six-hour model forecast as the analysis samples, because it is crucial to check whether the ensemble spread and the spatial correlation at the first analysis time maintain reasonable.

#### **5. Data Assimilation Experiments**

Two different initial ensemble generation methods are tested for the LETKF data assimilation of the AGCM. One method is the spatially-correlated random perturbation scheme [5], and the other one is the MEOF-based balanced perturbation scheme. For the two initial perturbation schemes, 80 ensemble members are adopted for the ensemble data assimilation process. The observational data adopted here are the global upper air and surface weather observation data in PREPBUFR format, which are usually used as the conventional observation data for the data assimilation system. The data include land surface, marine surface, radiosonde, pibal and aircraft reports, profiler and radar derived winds, satellite wind data and so on. The data can include pressure, geopotential height, temperature, dew point temperature, wind direction and speed. The conventional observations are grouped into a time window of 6 hours, which are centered on the analysis time, and then are assimilated into the model every 6 hours from 1 January to 10 January 2004, which are at 0000, 0600, 1200 and 1800 UTC. An example figure of the conventional observation data of the surface temperature is shown in Figure 1.

As we can find out in the model integration process, the integration of the temperature variable over time will also influence the integration of the other model variables. So for the generation of the randomly perturbed initial ensemble, we just add a 3-D random noise of a certain magnitude (1% of the magnitude of T) to the temperature variable of the atmosphere general circulation model at all layers, following Evensen's idea [5]. The random perturbation is generated with a horizontal correlation scale of 2000 km and a vertical correlation scale of 1000 km, as well as a relativity of 0.8 between two adjoint layers. For the generation of the MEOF-based perturbed initial ensemble, we implement the multivariable balanced initial perturbation scheme as described in Section 4. The spatial distribution of the model snapshots derived from the MEOF analysis and the 3-D random perturbation are combined to generate a balanced perturbation field.

**Figure 1.** Conventional observation data of temperature at the surface layer at 06UTC 20040101.

#### *5.1. The MEOF Analysis Results*

For the MEOF analysis and the MEOF-based perturbation generation, the AGCM is integrated from 1 January to 31 March 2004 to generate the six-hour model forecast outputs. A total of 360 snapshots are adopted to make the MEOF analysis. Compared to the EOF function analysis for each individual model variable, the MEOF function analysis combines

all the model variables in one vector. Figure 2 shows the variance contributions of the first 24 modes for the MEOF analysis of the surface pressure. The total variance contribution of the first 16 MEOF modes have been more than 99%. So the first 20 MEOF modes are adopted to generated the balanced perturbation fields. The spatial distribution and the time coefficients of the first three MEOF modes of the surface pressure (Ps) is shown in Figure 3. Similarly, we can see the detailed MEOF analysis results of the temperature (T) at the surface layer in Figures 4 and 5. The total variance contribution of the first twenty MEOF modes have been also more than 99%.

**Figure 2.** Variance contributions of the first 24 modes from the MEOF analysis of Ps.

**Figure 3.** The spatial distribution (**a**) and the time coefficients (**b**) of the first three modes of the MEOF analysis of Ps.

**Figure 4.** Variance contributions of the first 24 modes from the MEOF analysis of T.

**Figure 5.** The spatial distribution (**a**) and the time coefficients (**b**) of the first three modes of the MEOF analysis of T.

#### *5.2. Ensemble Spread*

To verify the quality of the generated initial ensemble, it's essential to compare the ensemble spread of the initial ensemble and the model outputs after six-hour integration, which is at 06UTC 1 January 2004. A reasonable ensemble spread should represent well the distribution of the forecast uncertainties before the assimilation took place, and a larger ensemble spread can result in a Kalman gain that reasonably draws the analysis closer to the observations [28]. For the random perturbation and the MEOF balanced perturbation scheme, the ensemble spreads of T both decrease after six-hour integration compared with

the initial ensemble spread, as shown in Figures 6 and 7. The difference is that the ensemble spread of the MEOF balanced perturbed ensemble decreases much less than that of the randomly perturbed ensemble. The averaged spread of the randomly perturbed initial ensemble of the temperature at the surface layer is about 3.3 degree, which decreases to about 1.4 degree after six-hour integration. As a contrast, the averaged ensemble spread of the MEOF-based balanced initial perturbation of the temperature at the surface layer is about 7.2 degree, which decreases to about 6.1 degree after six-hour integration. It's shown that the MEOF balanced perturbation could maintain the ensemble spread more reasonable, which is very important for the data assimilation process.

**Figure 6.** The initial (**a**) and the 6-h integration (**b**) ensemble spread of the randomly perturbed T.

**Figure 7.** The initial (**a**) and the 6-h integration (**b**) ensemble spread of the MEOF perturbed T.

#### *5.3. Horizontal Correlation*

Take the surface pressure as example, we calculated the horizontal correlation of four locations for both the randomly perturbed initial ensemble and the MEOF-based balanced initial ensemble. The four locations are chosen as (67.5 E, 33.31 N), (90 E, 68.74 S), (178.59 E, 0.71 S) and (61.87 W, 55.98 N). Figure 8 shows the historical horizontal correlations of the surface pressure at the chosen four locations. The historical results include the sixhour model integration outputs from 1 August to 31 October 2004. Figure 9 shows the horizontal correlations of the randomly perturbed ensemble of the surface pressure at the chosen four locations. The ensembles used to calculate the horizontal correlation is the six-hour forecast of the MEOF-based perturbed initial ensemble. Figure 10 shows the horizontal correlations of the MEOF-based perturbed ensemble of the surface pressure at the chosen four locations. The ensembles used to calculate the horizontal correlation is the six-hour forecast of the randomly perturbed initial ensemble. We can see that the horizontal correlations of the MEOF-based perturbed ensemble of the surface pressure are much more similar to the historical horizontal correlations of the model integration, compared with the horizontal correlations of the randomly perturbed ensemble. The horizontal correlations of the randomly perturbed ensemble have the normal oval shape and can't represent the historical characteristics in the middle and high latitude area. It's shown that the ensembles generated from the MEOF perturbations could represent the historical horizontal

correlations better. Similar conclusions could be driven for the other state variables, such as the temperature, the wind velocity and humidity.

**Figure 8.** The historical horizontal correlation of the model integration results at four locations.

**Figure 9.** The horizontal correlation of the randomly perturbed ensemble at four locations.

**Figure 10.** The horizontal correlation of the MEOF based perturbed ensemble at four locations.

#### *5.4. LETKF Data Assimilation Results*

The LETKF data assimilation scheme is applied to the atmospheric general circulation model using 80 ensembles. For the initial ensemble generation, the spatially-correlated random perturbation scheme and the MEOF-based balanced perturbation scheme are implemented and compared from several aspects, such as the ensemble spread and the horizontal correlation. We can see that the initial ensemble generated from the MEOFbased balanced perturbation has a better performance, as the ensemble forecasted from the MEOF-based perturbed initial ensemble could maintain a better spread and their horizontal correlation is more compatible with the horizontal correlation of the historical model output. Here, we adopted the MEOF-based perturbed initial ensemble to start the data assimilation process. The observation adopted here is the six-hour conventional observation data starting from 06UTC 1 January 2004. The observation data of the temperature, the meridional wind and the zonal wind have been assimilated into the AGCM. Figure 11 shows the root mean square error(RMSE) of the LETKF data assimilation results of the surface temperature for the first six data assimilation times, compared with the conventional observation data (see Figure 1). It seems that the RMSE of the data assimilation results derived from both the randomly and MEOF-based perturbed initial ensemble is smaller than the RMSE of the control model, which means the initial ensembles generated from both the two methods worked during the data assimilation process. It's also shown that the RMSE of the data assimilation results derived from the MEOF-based perturbed initial ensemble is smaller than those derived from the randomly perturbed initial ensemble. Because the initial ensemble generated by the MEOF-based perturbation has better physical relationships between the model variables, the data assimilation effect is further improved. The LETKF data assimilation also improved the meridional and zonal wind result compared to the observation (figures not shown).

**Figure 11.** The RMSE of the LETKF data assimilation results compared with the observation data for the surface temperature.

#### **6. Conclusions**

Based on the multivariate empirical orthogonal function (MEOF) method, a multivariate balanced initial ensemble generation method was applied to the ensemble data assimilation scheme. The initial ensembles were generated with a reasonable consideration of the physical relationships between different model variables. For the initial ensemble generation, the spatially-correlated random perturbation scheme and the MEOF-based balanced perturbation scheme are implemented and compared from several aspects, such as the ensemble spread and the horizontal correlation. From the analysis of ensemble spread and the horizontal correlation, we can see that the initial perturbations generated based on the MEOF method are much more effective considering they will not decay rapidly as the model integrates. The ensembles integrated from the initial ensemble generated from the MEOF-based perturbations will maintain a much more reasonable spread and a more reliable horizontal correlation than those from the randomly perturbed initial fields. The Local Ensemble Transform Kalman Filter (LETKF) data assimilation scheme was established for an atmospheric general circulation model. Ensemble data assimilation experiments using different initial ensemble generation methods, spatially random and MEOF-based balanced, are performed using realistic atmospheric observations. The model predictions were also improved by adopting the MEOF-based balanced multivariate initial ensembles. At the present, only the conventional observation data is assimilated into the AGCM. More data assimilation experiments with the LETKF scheme using the satellite observation data will be made in the future research.

**Author Contributions:** Conceptualization, J.D.; methodology, J.D. and F.Z.; software, J.D. and H.Z.; validation, J.D.; formal analysis, J.D.; writing—original draft preparation, J.D.; writing—review and editing, J.D., F.Z. and J.Z.; visualization, J.D.; supervision, J.Z.; funding acquisition, J.D., F.Z., H.Z. and J.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Key Research and Development Program of China (Grant No. 2018YFA0605703), the National Natural Science Foundation of China (Grant No. 41776041; 41630530), the Key Research Program of Frontier Sciences, CAS (Grant No. ZDBS-LY-DQC010), the National Natural Science Foundation of China (Grant No. 41876012; 41861144015) and the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDB42000000).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Water* Editorial Office E-mail: water@mdpi.com www.mdpi.com/journal/water

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18