*3.1. General Euler-Lagrange Equations*

Among differential calculus' many applications are problems regarding finding the maxima and minima of functions. Analogously, techniques of calculus of variation operates on *functionals*, which are mappings from a space of functions to its underlying field of scalars. considering the functional *L*[*y*], the typical formulation of a calculation of variations problem is

$$\begin{aligned} \min \quad & L[\underline{y}] = \int\_{x\_0}^{x\_1} F(x, y, \dot{y}) dx \\ & y(x\_0) = y\_0 \qquad & y(x\_1) = y\_1 \end{aligned} \tag{1}$$

where initial and terminal values are defined as (*x*0, *y*0) and (*x*1, *y*1), respectively and *y*˙ is notation for the derivative. In general, *y* can be a vector of functions dependent on *x*, a vector of independent variables. The theory behind finding the extremum to problems such as these are analogous to single variable calculus, in which we a vanishing first derivative to locate critical points. Here, we locate the extremal functions using functional derivatives, leading to solving the Euler-Lagrange equations outlined in Section 3.2.

Though the formulation of finding the shortest path as a calculus of variations problem is rather elementary, its solution is involved. Obviously, in Euclidean geometry, this path is a straight line, and this distance is easily found. However, moving these ideas onto statistical manifolds complicates both the geometry and the calculus of this seemingly elementary problem. Analogous to the Euclidean setting, in a Riemannian manifold such as our space of Gaussians, solving for the shortest path *L*, involves the summation of many infinitely small arc lengths, *ds*

$$ds^2 = \dot{\theta}^T g(\theta) \dot{\theta},\tag{2}$$

where *θ* is a parameter vector, *g*(*θ*) is a metric tensor dependent on the parameter vector and (·)*<sup>T</sup>* represents the transpose of a vector. The metric tensor for Euclidean space is the identity matrix but on the multivariate Gaussian manifold, this metric tensor is the Fisher information matrix, discussed later in Section 3.2.

This makes the functional we wish to minimize

$$P = \int\_{x\_0}^{x\_1} \sqrt{\dot{\theta}^T \mathcal{g}(\theta) \dot{\theta}} d\mathbf{x} \tag{3}$$

or, because the square root is a monotonically increasing function, we can conveniently use

$$\mathcal{F} = \int\_{\chi\_0}^{\chi\_1} \theta^T \mathcal{g}(\theta) \theta d\mathfrak{x}.\tag{4}$$

With this, the calculus of variation problem that solves for the minimum distance on a manifold is

$$\begin{aligned} \min \quad & \mathcal{F}[\theta] = \int\_{\mathbf{x}\_{0}}^{\mathbf{x}\_{1}} \dot{\theta}^{T} \mathbf{g}(\theta) \dot{\theta} d\mathbf{x} \\ & \theta(\mathbf{x}\_{0}) = [\theta\_{01}, \theta\_{02}, \dots, \theta\_{0n}] \qquad & \theta(\mathbf{x}\_{1}) = [\theta\_{11}, \theta\_{12}, \dots, \theta\_{1n}] \end{aligned} \tag{5}$$

In the present context, the Fisher information metric tensor *g*(*θ*) = *g*(*μ*, Σ), the natural parameterization for multivariate Gaussians. Moreover, *θ*<sup>0</sup> and *θ*<sup>1</sup> are the parameters of the initial and final distributions.

The minimizer to Equation (5) is the well-known Euler-Lagrange equation, a system of second-order differential equations. These equations operate on the function in Equation (4). Accordingly, we define

$$\mathcal{K} = \dot{\theta}^T \mathcal{g}(\theta) \dot{\theta} \tag{6}$$

With this, the Euler-Lagrange equations are

$$K\_{\theta} - \frac{d}{d\mathfrak{x}} K\_{\theta} = 0.\tag{7}$$

where *K<sup>θ</sup>* and *K*˙ *<sup>θ</sup>* are the functional derivatives with respect the curve *θ*(*x*).

#### *3.2. Euler-Lagrange Equation for Gaussian Distributions*

The Fisher information matrix is a measure of how much information about the parameter of interest from a multivariate distribution is revealed from random data space. Intuitively, it can be considered an indication of how "peaked" a distribution is around a parameter. If the distribution is sharply peaked, very few data points are required to locate it. As such, each data point carries a lot of information.

For a multivariate probability distribution, the Fisher information matrix is given by

$$\log g\_{i,j}(\theta) = \int f(\mathbf{x};\theta) \frac{\partial}{\partial \theta\_i} \log f(\mathbf{x};\theta) \frac{\partial}{\partial \theta\_j} \log f(\mathbf{x};\theta) d\mathbf{x},\tag{8}$$

where the index (*i*, *j*) represents the appropriate parameter pair of the multivariate parameter vector *θ*.

Alternatively, there are additional useful forms of the Fisher information, provided that certain regularity conditions are satisfied. First, the Fisher information matrix is the expectation of the Hessian of the log likelihood of the density function. Specifically,

$$g\_{i,j}(\theta) = -\mathcal{E}\left[\frac{\partial^2}{\partial \theta\_i \partial \theta\_j} \log f(\mathbf{x}; \theta)\right] = -\mathcal{E}[H]. \tag{9}$$

where *H* is the Hessian matrix of the log-likelihood.

Second, the Fisher information can be calculated from the variance of the score function

$$\mathbf{g}(\theta) = \text{Var}(\mathbf{S}\_f(\mathbf{x}; \theta)),\tag{10}$$

where

$$S\_f(\mathbf{x}; \theta) = \nabla \log f(\mathbf{x}; \theta). \tag{11}$$

Most importantly and for our purposes, the Fisher information matrix is the metric tensor that will define distances on Riemannian Gaussian manifold. Given a distribution on a manifold, by use of this metric tensor, we can minimize an appropriate functional to find a closest second distribution residing on a constrained subset the manifold. A class of problems covered by variable-endpoint conditions in the calculus of variations.

Consider the *n*-dimensional multivariate Gaussian with density given by

$$f(\mathbf{x}\_n : \mu\_n, \Sigma) = 2\pi^{-\frac{\mu}{2}} \det(\Sigma)^{-\frac{1}{2}} \exp - \frac{(X - \mu)^T \Sigma^{-1} (X - \mu)}{2} \tag{12}$$

where *X* is the random variable vector, *μ* = *μ*1, *μ*2, ..., *μ<sup>n</sup>* is the *n*-dimensional mean vector of the distribution and Σ is the *n* × *n* covariance matrix.

Since the covariance matrix is symmetric, it contains (*n*+1)(*n*) <sup>2</sup> number of unique parameters, i.e., the number of diagonal and the upper (or lower) triangular elements. With the *n*-dimensional mean vector, the total number of scalar parameters in an *n*-dimensional multivariate Gaussian is (*n*+3)*<sup>n</sup>* <sup>2</sup> , which will be the size of the Fisher information matrix. For all further developments in the parameter space, these parameters are collected in a single vector, *θ* such that

$$\theta = \{ \mu\_{1\prime}, \mu\_{2\prime}, \dots \mu\_{n\prime}, \sigma\_{1,1\prime}^2, \sigma\_{1,2\prime}^2, \dots \sigma\_{n,n}^2 \}\tag{13}$$

To clarify, this new parameter *θ* has the mean vector *μ* as its first *n* components and the resulting components are made up of the unique elements of the covariance matrix, starting with the first row, followed, by the second row but without the first entry, since Σ1,2 = Σ2,1 and Σ1,2 is already included in *θ*. We capture all the parameters of the multivariate Gaussian distribution in this non-traditional vector form because it is more conceptually in line with the calculation of the Fisher information matrix defined in Equation (8).

Therefore, using Equations (8) and (12), the Fisher information for the general multivariate Gaussian distribution is

$$g\_{ij}(\mu,\Sigma) = \frac{1}{2}tr\left[\left(\Sigma^{-1}\frac{\partial\Sigma}{\partial\theta\_i}\right)\left(\Sigma^{-1}\frac{\partial\Sigma}{\partial\theta\_j}\right)\right] + \frac{\partial\mu^T}{\partial\theta\_i}\Sigma^{-1}\frac{\partial\mu}{\partial\theta\_j} \tag{14}$$

for which a very detailed proof can be found in the Appendix A of this paper. In the case of the bivariate Gaussian distribution, this 5 × 5 matrix has only 15 unique elements, because of its symmetry. Once again, the detailed derivation of each of the elements is provided in the Appendix A. The resulting metric tensor elements are

$$\begin{aligned} \text{g11} &= \frac{\sigma\_2^2}{\sigma\_1^2 \sigma\_2^2 - \sigma\_{12}^2} \\ \text{g22} &= \frac{\sigma\_1^2}{\sigma\_1^2 \sigma\_2^2 - \sigma\_{12}^2} \\ \text{g33} &= \frac{1}{2} \left( \frac{\sigma\_2^2}{\sigma\_1^2 \sigma\_2^2 - \sigma\_{12}^2} \right)^2 \\ \text{g44} &= \frac{1}{2} \left( \frac{\sigma\_1^2}{\sigma\_1^2 \sigma\_2^2 - \sigma\_{12}^2} \right)^2 \\ \text{g55} &= \frac{\sigma\_1^2 \sigma\_2^2 + \sigma\_{12}^2}{\left(\sigma\_1^2 \sigma\_2^2 - \sigma\_{12}^2\right)^2} \\ \text{g12} &= -\frac{\sigma\_{12}}{\sigma\_1^2 \sigma\_2^2 - \sigma\_{12}^2} = \mathfrak{g}\mathfrak{g}1 \\ \text{g34} &= \frac{1}{2} \left( \frac{\sigma\_{12}}{\sigma\_1^2 \sigma\_2^2 - \sigma\_{12}^2} \right)^2 = \mathfrak{g}\mathfrak{g}3 \\ \text{g35} &= -\frac{\sigma\_{12}\sigma\_2^2}{\left(\sigma\_1^2 \sigma\_2^2 - \sigma\_{12}^2\right)^2} = \mathfrak{g}\mathfrak{g}3 \\ \text{g45} &= -\frac{\sigma\_{12}\sigma\_1^2}{\left(\sigma\_1^2 \sigma\_2^2 - \sigma\_{12}^2\right)^2} = \mathfrak{g}\mathfrak{g}4 \end{aligned} \tag{25}$$

All elements dealing with the information between a component of *μ* and an element of Σ vanish, which is a property extended to every multivariate Gaussian distribution of higher dimensions as well.

#### *3.3. Variable-Endpoint Formulation: Transversality Boundary Conditions*

Our development so far has focused on summarizing the usual situation of finding the length-minimizing path between two fixed points on a Riemannian manifold. We now shift to the situation of allowing one or both of the initial or final endpoints (distributions in this context) to be variable. This changes both the scope and the mathematics of the problem and requires the use of transversality boundary conditions.

Transversality conditions have been useful in other applications. In economics, the conditions are needed to solve the infinite horizon [40,41] problems. Other applications are found in biology [42] and physics [43]. However, to the best of our knowledge, there is little to no prior work investigating variable-endpoint formulations in the domain of information geometry. As we will demonstrate, the crux of employing these methods revolves around appropriately defining the parameter constraint surface. Though our results here provide concrete examples of interesting constraint surfaces, guidance on prescribing these subsets or developing techniques for automatically learning them will be important areas for future research.

The transversality conditions take into account the constrained hypersurface, *φ*(*θ*), from the coordinates parameterizing the distributions of the statistical manifold. If, for example, we are given an initial distribution and are asked to find which distribution on *φ*(*θ*) is closest, the usual geodesic problem formulated in Equation (5) now becomes

$$\begin{aligned} \min \quad & \mathcal{F}[\theta] = \frac{1}{2} \int\_{x\_0}^{x\_1} \theta^T g(\theta) \theta d\mathbf{x} \\ & \theta(x\_0) = [\theta\_{01}, \theta\_{02}, \dots, \theta\_{0n}] \qquad & \theta(x\_1) = \phi(\theta) \end{aligned} \tag{16}$$

Therefore, in addition to satisfying the Euler-Lagrange equation, now with transversality conditions, the optimal solution must also satisfy

$$\begin{bmatrix} \mathcal{K}\_{\boldsymbol{\vartheta}\_1} \\ \mathcal{K}\_{\boldsymbol{\vartheta}\_2} \\ \vdots \\ \mathcal{K}\_{\boldsymbol{\vartheta}\_n} \end{bmatrix} = \alpha \begin{bmatrix} \boldsymbol{\Phi}\_{\boldsymbol{\theta}\_1} \\ \boldsymbol{\Phi}\_{\boldsymbol{\theta}\_2} \\ \vdots \\ \boldsymbol{\Phi}\_{\boldsymbol{\theta}\_n} \end{bmatrix} \tag{17}$$

In Equation (17), the left-hand side is a vector tangent to the optimal path and the vector on the right right-hand side is the gradient of the terminal surface, which is orthogonal to the surface. Considering that the optimal path and the constraint surface intersect at the terminal distribution, this view of the transversality requirement implies that the tangent vector to the optimal path and the gradient of the constraint surface be collinear at the intersecting distribution. The scalar multiple *α* affects that magnitude of the vector and, from a geometric perspective, there is no loss of generality by setting *α* = 1.

#### **4. Results: Transversal Euler-Lagrange Equations for Bivariate Gaussian Distributions**

We now turn our attention to various use cases of the transversal boundary conditions on the Gaussian manifold. We limit our derivations to bivariate Gaussians to make the calculations tractable and for visualization purposes. However, the same development is applicable to higher-dimensional Gaussians. It is worth mentioning that even in the fixed-endpoint scenario, there are no closed-form solutions for the geodesic when manifold coordinates include *μ* and Σ, with analytical solutions existing only for special cases such zero-mean distributions.

Using the Fisher information matrix defined in Equation (14), more specifically for the bivariate Gaussian distribution discussed in the Appendix, and employing the general form of the geodesic functional in Equation (4), we can define the integrand of the arclength-minimizing function on the space of bivariate Gaussian distribution as

$$\begin{split} K(\theta) &= \frac{\sigma\_2^2 \dot{\mu}\_1^2}{k} + \frac{\sigma\_1^2 \dot{\mu}\_2^2}{k} + \frac{(\sigma\_2^2)^2 (\upsilon\_1^2)^2}{2k^2} + \frac{(\sigma\_1^2)^2 (\upsilon\_2^2)^2}{2k^2} + \frac{\sigma\_1^2 \sigma\_2^2 \upsilon\_{12}^2}{k^2} + \frac{\sigma\_{12}^2 \upsilon\_{12}^2}{k^2} \\ &- \frac{2\sigma\_{12} \dot{\mu}\_1 \dot{\mu}\_2}{k} + \frac{\sigma\_{12}^2 \upsilon\_1^2 \upsilon\_2^2}{k^2} - \frac{2\sigma\_{12} \sigma\_2^2 \upsilon\_1^2 \upsilon\_{12}}{k^2} - \frac{2\sigma\_{12} \sigma\_1^2 \upsilon\_2^2 \upsilon\_{12}}{k^2} \end{split} \tag{18}$$

where *k* = *σ*<sup>2</sup> <sup>1</sup> *<sup>σ</sup>*<sup>2</sup> <sup>2</sup> <sup>−</sup> *<sup>σ</sup>*<sup>2</sup> 12.

We can use Equation (18) to derive the system of second-order differential equations, solutions to which yield the shortest path between two distributions.

$$\dot{\mu}\_1 = \frac{\dot{\mu}\_1 \sigma\_1^2 \sigma\_2^2 + \dot{\mu}\_2 \sigma\_1^2 \upsilon\_{12} - \dot{\mu}\_2 \sigma\_1^2 \upsilon\_{12} - \dot{\mu}\_1 \sigma\_{12} \upsilon\_{12}}{\sigma\_1^2 \sigma\_2^2 - \sigma\_{12}^2} \tag{19}$$

$$\dot{\mu}\_2 = \frac{\dot{\mu}\_2 \sigma\_1^2 \dot{\sigma}\_2^2 + \dot{\mu}\_1 \sigma\_2^2 \dot{\sigma}\_{12} - \dot{\mu}\_1 \dot{\sigma}\_2^2 \sigma\_{12} - \dot{\mu}\_2 \sigma\_{12} \dot{\sigma}\_{12}}{\sigma\_1^2 \sigma\_2^2 - \sigma\_{12}^2} \tag{20}$$

$$
\sigma\_1^2 = \frac{\mu\_1^2 \sigma\_{12}^2 + \sigma\_1^2 \sigma\_2^2 + \sigma\_1^2 \sigma\_{12}^2 - \mu\_1^2 \sigma\_1^2 \sigma\_2^2 - 2\sigma\_1^2 \sigma\_{12} \sigma\_{12}}{\sigma\_1^2 \sigma\_2^2 - \sigma\_{12}^2} \tag{21}
$$

$$\ddot{\sigma}\_2^2 = \frac{\dot{\mu}\_2^2 \sigma\_{12}^2 + \dot{\sigma}\_2^2 \sigma\_1^2 + \sigma\_2^2 \dot{\sigma}\_{12}^2 - \dot{\mu}\_2^2 \sigma\_1^2 \sigma\_2^2 - 2\dot{\sigma}\_2^2 \sigma\_{12} \dot{\sigma}\_{12}}{\sigma\_1^2 \sigma\_2^2 - \sigma\_{12}^2} \tag{22}$$

$$\psi\_{12} = -\frac{\sigma\_{12}\psi\_{12}^2 - \dot{\mu}\_2\dot{\mu}\_2\sigma\_{12}^2 - \sigma\_1^2\psi\_2^2\psi\_{12}^2 - \dot{\sigma}\_1^2\sigma\_2^2\psi\_{12}^2 + \dot{\sigma}\_1^2\psi\_2^2\sigma\_{12}^2 + \dot{\mu}\_2\dot{\mu}\_2\sigma\_1^2\psi\_2^2}{\sigma\_1^2\sigma\_2^2 - \sigma\_{12}^2} \tag{23}$$

Along with satisfying this system of equations, the solutions presented here must satisfy transversality conditions at one or both the terminal and initial boundaries. In what follows, those conditions will be prescribed accordingly, considering various applications of interest.
