*Proceeding Paper* **Orbit Classification and Sensitivity Analysis in Dynamical Systems Using Surrogate Models †**

**Katharina Rath 1,2,\*, Christopher G. Albert 2, Bernd Bischl <sup>1</sup> and Udo von Toussaint <sup>2</sup>**


**Abstract:** Dynamics of many classical physics systems are described in terms of Hamilton's equations. Commonly, initial conditions are only imperfectly known. The associated volume in phase space is preserved over time due to the symplecticity of the Hamiltonian flow. Here we study the propagation of uncertain initial conditions through dynamical systems using symplectic surrogate models of Hamiltonian flow maps. This allows fast sensitivity analysis with respect to the distribution of initial conditions and an estimation of local Lyapunov exponents (LLE) that give insight into local predictability of a dynamical system. In Hamiltonian systems, LLEs permit a distinction between regular and chaotic orbits. Combined with Bayesian methods we provide a statistical analysis of local stability and sensitivity in phase space for Hamiltonian systems. The intended application is the early classification of regular and chaotic orbits of fusion alpha particles in stellarator reactors. The degree of stochastization during a given time period is used as an estimate for the probability that orbits of a specific region in phase space are lost at the plasma boundary. Thus, the approach offers a promising way to accelerate the computation of fusion alpha particle losses.

**Keywords:** Gaussian process regression; surrogate model; Lyapunov exponent; sensitivity analysis; Hamiltonian systems

#### **1. Introduction**

Hamilton's equations describe the dynamics of many classical physics systems such as classical mechanics, plasma physics or electrodynamics. In most of these cases, chaos plays an important role [1]. One fundamental question in analyzing these chaotic Hamiltonian systems is the distinction between regular and chaotic regions in phase space. A commonly used tool are Poincaré maps, which connect subsequent intersections of orbits with a lower-dimensional subspace, called Poincaré section. For example, in a planetary system one could record a section each time the planet has made a turn around the Sun. The resulting pattern of intersection points on this subspace allow insight into the dynamics of the underlying system: regular orbits stay bound to a closed hyper-surface and do not leave the confinement volume, whereas chaotic orbits might spread over the whole phase space. This is related to the breaking of KAM (Kolmogorov-Arnold-Moser) surfaces that form barriers for motion in phase space [2]. The classification of regular versus chaotic orbits is performed, e.g., via box-counting [3] or by calculating the spectrum of Lyapunov exponents [4–6]. Lyapunov exponents measure the asymptotic average exponential rate of divergence of nearby orbits in phase space over infinite time and are therefore invariants of the dynamical system. When considering only finite time, the obtained *local* Lyapunov exponents (LLEs) for a specific starting position depend on the position in phase space and give insight into the local predictability of the dynamical system of interest [7–10].

**Citation:** Rath, K.; Albert, C.G.; Bischl, B.; von Toussaint, U. Orbit Classification and Sensitivity Analysis in Dynamical Systems Using Surrogate Models. *Phys. Sci. Forum* **2021**, *3*, 5. https://doi.org/10.3390/ psf2021003005

Published: 5 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Poincaré maps are in most cases inefficient to compute as their computation involves numerical integration of Hamilton's equations even though only intersections with the surface of interest are recorded. When using a surrogate model to interpolate the Poincaré map, the symplectic structure of phase space arising from the description in terms of the Hamiltonian description has to be preserved to obtain long-term stability and conservation of invariants of motion, e.g., volume preservation. Additional information on Hamiltonian systems and symplecticity can be found in [2,11]. Here, we use a structure-preserving Gaussian process surrogate model (SympGPR) that interpolates directly between Poincaré sections and thus avoids unnecessary computation while achieving similar accuracy as standard numerical integration schemes [12].

In the present work, we investigate how the symplectic surrogate model [12] can be used for early classification of chaotic versus regular trajectories based on the calculation of LLEs. The latter are calculated using the Jacobian that is directly available from the surrogate model [13]. As LLEs also depend on time, we study their distribution on various time scales to estimate the needed number of mapping iterations. We combine the orbit classification with a sensitivity analysis based on variance decomposition [14–16] to evaluate the influence of uncertain initial conditions in different regions of phase space. The analysis is carried out on the well-known standard map [17] that is well suited for validation purposes as a closed form expression for the Poincaré maps is available. This, however, does not influence the performance of the surrogate model that is applicable also in cases where such a closed form doesn't exist [12].

The intended application is the early classification of regular and chaotic orbits of fusion alpha particles in stellarator reactors [3]. While regular particles can be expected to remain confined indefinitely, only chaotic orbits have to be traced to the end. This offers a promising way to accelerate loss computations for stellarator optimization.

#### **2. Methods**

#### *2.1. Hamiltonian Systems*

A *f*−dimensional system (with 2 *f*−dimensional phase space) described by its Hamiltonian *H*(*q*, *p*, *t*) depending on *f* generalized coordinates *q* and *f* generalized momenta *p* satisfies Hamilton's canonical equations of motion,

$$\dot{q}(t) = \frac{d\boldsymbol{q}(t)}{dt} = \nabla\_p H(\boldsymbol{q}(t), \boldsymbol{p}(t)), \qquad \dot{\boldsymbol{p}}(t) = \frac{d\boldsymbol{p}(t)}{dt} = -\nabla\_q H(\boldsymbol{q}(t), \boldsymbol{p}(t)), \tag{1}$$

which represent the time evolution as integral curves of the Hamiltonian vector field.

Here, we consider the standard map [17] that is a well-studied model to investigate chaos in Hamiltonian systems. Each mapping step corresponds to one Poincaré map of a periodically kicked rotator:

$$p\_{n+1} = (p\_n + K \sin(q\_n)) \bmod 2\pi, \qquad q\_{n+1} = (q\_n + p\_{n+1}) \bmod 2\pi,\tag{2}$$

where *K* is the stochasticity parameter corresponding to the intensity of the perturbation. The standard map is an area-preserving map with det*J* = 1, where *J* is its Jacobian:

$$J = \begin{pmatrix} \frac{\partial q\_{n+1}}{\partial q\_n} & \frac{\partial q\_{n+1}}{\partial p\_n} \\ \frac{\partial p\_{n+1}}{\partial q\_n} & \frac{\partial p\_{n+1}}{\partial p\_n} \end{pmatrix} = \begin{pmatrix} 1 + K \cos(q\_n) & 1 \\ K \cos(q\_n) & 1 \end{pmatrix} \tag{3}$$

#### *2.2. Symplectic Gaussian Process Emulation*

A Gaussian process (GP) [18] is a collection of random variables, any finite number of which have a joint Gaussian distribution. A GP is fully specified by its mean *m*(*x*) and kernel or covariance function *K*(*x*, *x*- ) and is denoted as

$$f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), \mathbb{K}(\mathbf{x}, \mathbf{x}')),\tag{4}$$

for input data points *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*d*. Here, we allow vector-valued functions *<sup>f</sup>*(*x*) <sup>∈</sup> <sup>R</sup>*<sup>D</sup>* [19]. The covariance function is a positive semidefinite matrix-valued function, whose entries (*K*(*x*, *x*- ))*ij* express the covariance between the output dimensions *i* and *j* of *f*(*x*).

For regression, we rely on observed function values *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>*D*×*<sup>N</sup>* with entries *<sup>y</sup>* <sup>=</sup> *<sup>f</sup>*(*x*) + . These observations may contain local Gaussian noise , i.e., the noise is independent at different positions *x* but may be correlated between components *y*. The input variables are aggregated in the *d* × *N* design matrix *X*, where *N* is the number of training data points. The posterior distribution, after taking training data points into account, is still a GP with updated mean *<sup>F</sup>*<sup>∗</sup> <sup>≡</sup> <sup>E</sup>(*F*(*X*∗)) and covariance function allowing to make predictions for test data *X*∗:

$$F\_\* = K(X\_\*, X)(K(X, X) + \Sigma\_n)^{-1}Y,\tag{5}$$

$$\text{cov}(F\_\*) = K(X\_\*, X\_\*) - K(X\_\*, X)(K(X\_\*X) + \Sigma\_n)^{-1}K(X, X\_\*), \tag{6}$$

where <sup>Σ</sup>*<sup>n</sup>* <sup>∈</sup> <sup>R</sup>*ND*×*ND* is the covariance matrix of the multivariate output noise for each training data point. Here we use the shorthand notation *K*(*X*, *X*) for the block matrix assembled over the output dimension *D* in addition to the number of input points as in a single-output GP with a scalar covariance function *k*(*x*, *x*- ) that expresses the covariance of different input data points *x* and *x*- . The kernel parameters are estimated given the input data by minimizing the negative log-likelihood [18].

To construct a GP emulator that interpolates symplectic maps for Hamiltonian systems, symplectic Gaussian process regression (SympGPR) was presented in [12] where the generating function *F*(*q*, *P*) and its gradients are interpolated using a multi-output GP with derivative observations [20,21]. The generating function links old coordinates (*q*, *p*) = (*qn*, *pn*) to new coordinates (*Q*, *P*)=(*qn*+1, *pn*+1) (e.g., after one iteration of the standard map Equation (2)) via a canonical transformation such that the symplectic property of phase space is preserved. Thus, input data points consist of pairs (*q*, *P*). Then, the covariance matrix contains the Hessian of an original scalar covariance function *k*(*q*, *P*, *q*- , *P*- ) as the lower block matrix *L*(*q*, *P*, *q*- , *P*- ) (denoted with the red box):

$$K(q, P, q', P') = \begin{pmatrix} k & \overbrace{\partial\_{q}k}^{} & \overbrace{\partial\_{q'}k}^{} & \overbrace{\partial\_{P'}k}^{} \\ \overbrace{\partial\_{P}k}^{} & \overbrace{\partial\_{Pq'}k}^{} & \overbrace{\partial\_{P'}k}^{} \end{pmatrix} . \tag{7}$$

Using the algorithm for the (semi-)implicit symplectic GP map as presented in [12], once the SympGPR model is trained and the covariance matrix calculated, the model is used to predict subsequent time steps or Poincaré maps for arbitrary initial conditions.

For the estimation of the Jacobian (Equation (3)) from the SympGPR, the Hessian of the generating function *F*(*q*, *P*) has to be inferred from the training data. Thus, the covariance matrix is extended with a block matrix *C* containing third derivatives of *k*(*q*, *P*, *q*- , *P*- ):

$$\mathbf{C} = \begin{pmatrix} \partial\_{q,q',q}k & \partial\_{q,P',q}k & \partial\_{P,q',q}k & \partial\_{P,P',q}k\\ \partial\_{q,q',P}k & \partial\_{q,P',P}k & \partial\_{P,q',P}k & \partial\_{P,P',P}k \end{pmatrix}. \tag{8}$$

The mean of the posterior distribution of the desired Hessian of the generating function *F*(*q*, *P*) is inferred via

$$\nabla^2 F = (\partial^2\_{q\eta} F, \partial^2\_{qP} F, \partial^2\_{P\eta} F, \partial^2\_{PP} F)^\top = \mathbb{C}L^{-1}Y. \tag{9}$$

As we have a dependence on mixed coordinates *Q*(*q*¯(*q*, *p*), *P*(*q*, *p*)) and *P*(*Q*(*q*, *p*), *p*¯(*q*, *p*)), where we used *q*¯(*q*, *p*) = *q* and *p*¯(*q*, *p*) = *p* to correctly carry out the inner derivatives, the needed elements for the Jacobian can be calculated employing the chain rule. The Jacobian is then given as the solution of the well-determined linear set of equations:

$$
\frac{\partial \underline{Q}}{\partial q} = \frac{\partial \underline{Q}}{\partial \underline{q}} \frac{\partial \underline{q}}{\partial q} + \frac{\partial \underline{Q}}{\partial P} \frac{\partial P}{\partial q'}, \qquad \frac{\partial \underline{Q}}{\partial p} = \frac{\partial \underline{Q}}{\partial \underline{q}} \frac{\partial \underline{q}}{\partial p} + \frac{\partial \underline{Q}}{\partial P} \frac{\partial P}{\partial p'} \tag{10}
$$

$$
\frac{\partial P}{\partial q} = \frac{\partial P}{\partial Q}\frac{\partial Q}{\partial q} + \frac{\partial P}{\partial p}\frac{\partial p}{\partial q'}, \qquad \frac{\partial P}{\partial p} = \frac{\partial P}{\partial Q}\frac{\partial Q}{\partial p} + \frac{\partial P}{\partial p}\frac{\partial p}{\partial p'} \tag{11}
$$

where we use the following correspondence to determine all factors of the SOEs:

$$
\begin{pmatrix}
\frac{\partial Q}{\partial q} & \frac{\partial Q}{\partial P} \\
\frac{\partial P}{\partial \dot{q}} & \frac{\partial P}{\partial P}
\end{pmatrix} = \begin{pmatrix}
\frac{\partial q}{\partial Q} & \frac{\partial P}{\partial Q} \\
\frac{\partial q}{\partial p} & \frac{\partial P}{\partial p}
\end{pmatrix}^{\top} = \begin{pmatrix}
1 + \frac{\partial^2 F}{\partial q \partial P} & -\frac{\partial^2 F}{\partial P \partial P} \\
\end{pmatrix}.\tag{12}
$$

#### *2.3. Sensitivity Analysis*

Variance-based sensitivity analysis decomposes the variance of the model output into portions associated with uncertainty in the model inputs or initial conditions [14,15]. Assuming independent input variables *Xi*, *i* = 1, ..., *d*, the functional analysis of variance (ANOVA) allows a decomposition of the scalar model output *Y* from which the decomposition of the variance can be deduced:

$$V[Y] = \sum\_{i=1}^{d} V\_i + \sum\_{1 \le i < j \le d} V\_{ij} + \dots + V\_{1,2,\dots,d} \tag{13}$$

The first term describes the variation in variance only due to changes in single variables *Xi*, whereas higher-order interactions are depicted in the contributions of the interaction terms. From this, first-order Sobol' indices *Si* are defined as the corresponding fraction of the total variance, whereas *total* Sobol' indices *STi* also take the influence of *Xi* interacting with other input variables into account [14,15]:

$$S\_i = \frac{V\_i}{\text{Var}(\mathbf{Y})'} \qquad S\_{T\_i} = \frac{E\_{X\_{\sim i}}(\text{Var}\_{X\_i}(\mathbf{Y}|\mathbf{X}\_{\sim i})}{\text{Var}(\mathbf{Y})} \tag{14}$$

Several methods for efficiently calculating Sobol' indices have been presented, e.g., MC sampling [14,16] or direct estimation from surrogate models [22,23]. Here, we use the MC sampling strategy presented in [16] using two sampling matrices *A*, *B* and a combination of both *A*(*i*) *<sup>B</sup>* , where all columns are from *A* except the *i*-th column which is from *B*:

$$S\_i \text{Var}(\mathbf{Y}) = \frac{1}{N} \sum\_{i=1}^{N} f(\mathbf{B})\_j (f(A\_{\mathcal{B}}^{(i)})\_j - f(\mathbf{A})\_j), \qquad S\_{\overline{\mathcal{T}\_i}} \text{Var}(\mathbf{Y}) = \frac{1}{2N} \sum\_{i=1}^{N} (f(\mathbf{A})\_j - f(A\_{\mathcal{B}}^{(i)})\_j)^2,\tag{15}$$

where *f* denotes the model to be evaluated.

#### *2.4. Local Lyapunov Exponents*

For a dynamical system in R*D*, *D* Lyapunov characteristic exponents *λ<sup>n</sup>* give the exponential separation of trajectories with initial conditions *z*(0)=(*q*(0), *p*(0)) of a dynamical system with perturbation *δz* over time:

$$|\delta \mathbf{z}(T)| = \mathcal{J}\_{\mathbf{z}(T)}^{(T)} \delta \mathbf{z}(0) \approx \mathbf{e}^{T\lambda} |\delta \mathbf{z}(0)|,\tag{16}$$

where <sup>J</sup> (*T*) *<sup>z</sup>*(*T*) is a time-ordered product of Jacobians J*z*(*T*−1)J*z*(*T*−2)...J*z*(1)J*z*(0) [4]. The Lyapunov exponents are then given as the logarithm of the eigenvalues of the positive and symmetric matrix.

$$\Lambda = \lim\_{T \to \infty} [\mathcal{J}^{(T) \top}\_{\mathbf{z}(T)} \mathcal{J}^{(T)}\_{\mathbf{z}(T)}]^{1/(2T)},\tag{17}$$

where denotes the transpose of <sup>J</sup> (*T*) *z*(*T*) .

For a *D*-dimensional system, there exist *D* Lyapunov exponents *λ<sup>n</sup>* giving the rate of growth of a *D*-volume element with *λ*<sup>1</sup> + ... + *λ<sup>D</sup>* corresponding to the rate of growth of the determinant of the Jacobian det(<sup>J</sup> (*T*) *z*(*T*) ). From this follows that for a Hamiltonian system with a symplectic (e.g., volume-preserving) phase space structure, Lyapunov exponents exist in additive inverse pairs as the determinant of the Jacobian is constant, *λ*<sup>1</sup> + ... + *λ<sup>D</sup>* = 0. In the dynamical system of the standard map with *D* = 2 considered here, the Lyapunov exponents allow a distinction between regular and chaotic motion. If the Lyapunov exponents *λ*<sup>1</sup> = −*λ*<sup>2</sup> > 0, neighboring orbits separate exponentially which corresponds to a chaotic region. In contrast, when *λ*<sup>1</sup> = −*λ*<sup>2</sup> ≈ 0 the motion is regular [1].

As the product of Jacobians is ill-conditioned for large values of *T*, several algorithms have been proposed to calculate the spectrum of Lyapunov exponents [13]. Here, we determine *local* Lyapunov exponents (LLE) that determine the predictability of an orbit of the system at a specific phase point for finite time. In contrast to *global* Lyapunov exponents they depend on *T* and on the position in phase space *z*. We use recurrent Gram-Schmidt orthonormalization procedure through QR decomposition [5,6,24], where we follow the evolution of *D* initially orthonormal deviation vectors *w<sup>n</sup>* <sup>0</sup> . The Jacobian is decomposed into <sup>J</sup>*z*(0) <sup>=</sup> *<sup>Q</sup>*(1)*R*(1), where *<sup>Q</sup>*(1) is an orthogonal matrix and *<sup>R</sup>*(1) is an upper triangular matrix yielding a new set of orthonormal vectors *wi*. At the next mapping iteration, the matrix product <sup>J</sup>*z*(1)*Q*(1) is again decomposed. This procedure is repeated *<sup>T</sup>* times to arrive at <sup>J</sup> (*T*) *<sup>z</sup>*(*t*) <sup>=</sup> *<sup>Q</sup>*(*T*)*R*(*T*)*R*(*T*−1)...*R*(0). The Lyapunov exponents are then estimated from the diagonal elements of *R*(*t*)

$$
\lambda\_n = \frac{1}{T} \sum\_{t=1}^T \ln R\_{nn}^{(t)}.\tag{18}
$$

#### **3. Results and Discussion**

In the following we apply an implicit SympGPR model with a product kernel [12]. Due to the periodic topology of the standard map we use a periodic kernel function to construct the covariance matrix in Equation (7) with periodicity 2*π* in *q*, whereas a squared exponential kernel is used in *P*:

$$k(q, q\_i, P, P\_i) = \sigma\_f^2 \exp\left(-\frac{\sin^2((q - q\_i)/2)}{2l\_q^2}\right) \exp\left(-\frac{(P - P\_i)^2}{2l\_P^2}\right). \tag{19}$$

Here *σ*<sup>2</sup> *<sup>f</sup>* specifies the amplitude of the fit and is set in accordance with the observations to 2 max(|*Y*|)2, where *<sup>Y</sup>* corresponds to the change in coordinates. The hyperparameters *lq*, *lP* are set to their maximum likelihood value by minimizing the negative log-likelihood given the input data using the L-BFGS-B routine implemented in Python [18]. The noise in observations is set to *σ*<sup>2</sup> *<sup>n</sup>* = 10<sup>−</sup>16. 30 initial data points are sampled from a Halton sequence to ensure good coverage of the training region in the range [0, 2*π*] × [0, 2*π*] and Equation (2) is evaluated once to obtain the corresponding final data points. Each pair of initial and final conditions constitutes one sample of the training data set. Once the model is trained, it is used to predict subsequent mapping steps for arbitrary initial conditions and to infer the corresponding Jacobians for the calculation of the local Lyapunov exponents. Here, we consider two test cases of the standard map with different values of the stochasticity parameter *K* = 0.9 and *K* = 2.0 (Equation (2)). For each of the test cases, a surrogate model is trained. While in the first case the last KAM surface is not yet broken and therefore the region of stochasticity is still confined in phase space, in the latter case the chaotic region covers a much larger portion of phase space. However, there still exist islands of stability with regular orbits [2]. For *K* = 0.9 the mean squared error (MSE) for the training data is 1.4 <sup>×</sup> <sup>10</sup>−6, whereas the test MSE after one mapping application is found to be 2.4 <sup>×</sup> <sup>10</sup>−6. A similar quality of the surrogate model is reached for *K* = 2.0, where the training MSE is 1.6 <sup>×</sup> <sup>10</sup>−<sup>7</sup> and the test MSE 2.4 <sup>×</sup> <sup>10</sup>−7.

#### *3.1. Local Lyapunov Exponents and Orbit Classification*

For the evaluation of the distribution of the local Lyapunov exponents with respect to the number of mapping iterations *T* and phase space position *z* = (*q*, *p*), 1000 points are sampled from each orbit under investigation. In the following, we only consider the maximum local Lyapunov exponent as it determines the predictability of the system. For each of the 1000 points, the LLEs are calculated using Equation (18), where the needed Jacobians are given by the surrogate model by evaluating Equation (9) and solving Equation (11).

Figure 1 shows the distributions for *K* = 2.0, *T* = 50, *T* = 100 and *T* = 1000 for two different initial conditions resulting in a regular and a chaotic orbit. In the regular case the distribution exhibits a sharp peak and with increasing *T* moves closer to 0. This bias due to the finite number of mapping iterations decreases with O(1/*T*) as shown in Figure 2 [25]. For the chaotic orbit, the distribution looks smooth and its median is clearly >0 as expected. For a smaller value of *K* = 0.9 the dynamics in phase space exhibit larger variety with regular, chaotic and also weakly chaotic orbits that remain confined in a small stochastic layer around hyperbolic points. Hence, the transition between regular, weakly chaotic and chaotic orbits is continuous due to the larger variety in phase space. For fewer mapping iterations, possible values of *λ* are overlapping, thus preventing a clear distinction between confined chaotic and chaotic orbits.

**Figure 1.** Distribution of local Lyapunov exponents for a (**a**) regular orbit (*q*, *p*)=(1.96, 4.91) and (**b**) chaotic orbit (*q*, *p*)=(0.39, 2.85) in the standard map with *K* = 2.0

**Figure 2.** Rate of convergence of the block bias due to finite number of mapping iterations for (**a**) *K* = 2.0 with a regular orbit (*q*, *p*)=(1.96, 4.91) (diamond) and a chaotic orbit (*q*, *p*)=(0.39, 2.85) (x) and (**b**) *K* = 0.9 with a regular orbit (*q*, *p*)=(1.76, 0.33) (diamond), a confined chaotic orbit (*q*, *p*)=(0.02, 2.54) (circle) and a chaotic orbit (*q*, *p*)=(0.2, 5.6) (x). The graphs show *λ*˜ *<sup>T</sup>*, the median of *λ<sup>T</sup>* for each *T*, with *λ*˜ *<sup>T</sup>* = *λ* + *c*/*T* fitted by linear regression of *Tλ*˜ *<sup>T</sup>* on *T*. The gray areas correspond to the standard deviation for 1000 test points.

When considering the whole phase space with 200 orbits with initial conditions sampled from a Halton sequence in the range [0, *π*] × [0, 2*π*], already *T* = 50 mapping iterations provide insight in the predictability of the standard map (Figure 3). If for a region in phase space the obtained LLE is positive, the predictability in this region is restricted as the instability there is relatively large. If, however, the LLE is close to zero, we can conclude that this region in phase space is governed by regular motion and is therefore highly predictable. For *K* = 2.0 the orbits constituting the chaotic sea have large positive LLEs, whereas islands of stability built by regular orbits show LLEs close to 0. A similar behavior can be observed for *K* = 0.9, where again regions around stable elliptic points feature *λ* ≈ 0 while stochastic regions exhibit a varying range of LLEs in accordance to Figure 2.

Based on the estimation of the LLEs, a Gaussian Bayesian classifier [26] is used to determine the probability of an orbit being regular, where we assume that LLEs are normally distributed in each class. First, the classifier is trained on LLEs resulting from 200 different initial conditions for *T* mapping iterations with the corresponding class labels resulting from the chosen reference being the generalized alignment index (GALI) [27]. Then, 10<sup>4</sup> test orbits are sampled from a regular grid in the range [0, *<sup>π</sup>*] <sup>×</sup> [0, 2*π*] with Δ*q* = Δ*p* = 2*π*/10, their LLE is calculated for *T* mapping iterations and the orbits are then classified. The results for *K* = 0.9 and *K* = 2.0 with *T* = 50 are shown in Figure 4, where the color map indicates the probability that the test orbit is regular. While for *K* = 2.0 the classifier provides a very clear distinction between regular and chaotic regions, the distinction between confined chaotic and regular orbits for *K* = 0.9 is less clear. With increasing number of mapping iterations, the number of misclassifications reduces as depicted in Figure 5. If the predicted probability that an orbit belongs to a certain class is lower than 70%, the prediction is not accepted and the orbit is marked as misclassified. With *K* = 0.9, the percentage of misclassified orbits does not drop below approximately 10%, because the transition between regular and chaotic motion is continuous.

**Figure 3.** Local Lyapunov exponents in phase space of the standard map calculated with *T* = 50 mapping iterations for (**a**) *K* = 2.0, (**b**) *K* = 0.9 .

**Figure 4.** Orbit classification in standard map, (**a**) *K* = 2.0, (**b**) *K* = 0.9 for *T* = 50. The color map indicates the probability that the orbit is regular.

**Figure 5.** Percentage of misclassified orbits using a Bayesian classifier trained with 200 orbits for (**a**) *K* = 2.0 and (**b**) *K* = 0.9. 100 test orbits on an equally spaced grid in the range of [0, *π*] × [0, 2*π*] are classified as regular or chaotic depending on their LLE.

#### *3.2. Sensitivity Analysis*

The total Sobol' indices are calculated for the outputs from the symplectic surrogate model (*Q*, *P*) using Equation (15) with *N* = 2000 uniformly distributed random points within a box of size [10−<sup>3</sup> <sup>×</sup> <sup>10</sup>−3] for each of the *<sup>T</sup>* <sup>=</sup> 100 mapping iterations as we are interested in the temporal evolution of the indices. For the standard map at *K* = 0.9 with *d* = 2 input and *D* = 2 output dimensions, 4 total Sobol' indices are obtained: *S<sup>Q</sup> <sup>q</sup>* and *S<sup>P</sup> q* denoting the influence of *q* and *S<sup>Q</sup> <sup>p</sup>* and *S<sup>P</sup> <sup>p</sup>* marking the influence of *p* on the output. We obtain good agreement with an MSE in the order of 10−<sup>6</sup> between the indices obtained by the surrogate model and those using reference data.

As shown in Figure 6 for three different initial conditions for *K* = 0.9 depending on the orbit type, either chaotic or regular, the sensitivity indices behave differently. In case of a regular orbit close to a fixed point, *S<sup>i</sup> <sup>j</sup>* are oscillating, indicating that both input variables have similar influence on average. Getting further from the fixed point, closer to the border of stability, the influence of *q* gets bigger. This, however, is in contrast to the behavior in the chaotic case, where initially the variance in *p* has larger influence on the model output. However, when observing the indices over longer periods of time, both variables have similar influence. In Movie S01 in the supplemental material, the time evolution of all four total Sobol' indices obtained for the standard map are shown in phase space. Each frame is averaged over 10 subsequent mapping iterations. One snapshot is shown in Figure 7. The observation of the whole phase space sustains the findings in Figure 6.

**Figure 6.** Total Sobol' indices as a function of time for three orbits of the standard map with *K* = 0.9 upper: chaotic orbit (*q*, *p*)=(0.2, 5.6), middle: regular orbit (*q*, *p*)=(1.76, 0.33), lower: regular orbit very close to fixed point (*q*, *p*)=(*π*, 0.1).

**Figure 7.** Total Sobol' indices (Equation (15)) for the standard map with *K* = 0.9 averaged from *t* = 20 to *t* = 30.

#### **4. Conclusions**

We presented an approach for orbit classification in Hamiltonian systems based on a structure preserving surrogate model combined with early classification based on local Lyapunov exponents directly available from the surrogate model. The approach was tested on two cases of the standard map. Depending on the perturbation strength, we either see a continuous transition from regular to chaotic orbits for *K* = 0.9 or a sharp separation between those two classes for higher perturbation strengths. This also impacts the classification results obtained from a Bayesian classifier. The presented method is applicable to chaotic Hamiltonian systems and is especially useful when a closed form expression for Poincaré maps is not available. Also, the accompanying sensitivity analysis provides valuable insight: in transition regions between regular and chaotic motion the Sobol' indices for time-series can be used to analyze the influence of input variables.

**Author Contributions:** Conceptualization, K.R., C.G.A., B.B. and U.v.T.; methodology, K.R., C.G.A., B.B. and U.v.T.; software, K.R.; validation, K.R., C.G.A., B.B. and U.v.T.; formal analysis, K.R., C.G.A., B.B. and U.v.T.; writing—original draft preparation, K.R.; visualization, K.R.; supervision, C.G.A., U.v.T. and B.B.; funding acquisition, C.G.A, B.B. and U.v.T. All authors have read and agreed to the published version of the manuscript.

**Funding:** The present contribution is supported by the Helmholtz Association of German Research Centers under the joint research school HIDSS-0006 "Munich School for Data Science-MUDS" and the Reduced Complexity grant No. ZT-I-0010.

**Data Availability Statement:** The data and source code that support the findings of this study are openly available [28] and maintained on https://github.com/redmod-team/SympGPR.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Proceeding Paper* **The ABC of Physics †**

**John Skilling 1,‡ and Kevin Knuth 2,\*,‡**


**Abstract:** Why quantum? Why spacetime? We find that the key idea underlying both is *uncertainty*. In a world lacking probes of unlimited delicacy, our knowledge of quantities is necessarily accompanied by uncertainty. Consequently, physics requires a calculus of number *pairs* and not only scalars for quantity alone. Basic symmetries of shuffling and sequencing dictate that pairs obey ordinary component-wise addition, but they can have three different multiplication rules. We call those rules A, B and C. "A" shows that pairs behave as complex numbers, which is *why quantum theory is complex*. However, consistency with the ordinary scalar rules of probability shows that the fundamental object is not a particle on its Hilbert sphere but a stream represented by a Gaussian distribution. "B" is then applied to pairs of complex numbers (qubits) and produces the Pauli matrices for which its operation defines the space of four vectors. "C" then allows integration of what can then be recognised as energy-momentum into time and space. The picture is entirely consistent. *Spacetime is a construct of quantum* and not a container for it.

**Keywords:** quantum; spacetime; uncertainty; symmetry

#### **1. Strategy**

Simplicity yields generality, and generality is power. There are deep mysteries in physics, such as why space has three dimensions and why quantum formalism is complex. Inquiry at such depth demands sparse assumptions of compelling generality. We aim to leave no room for plausible doubt; thus, we allow no delicate assumptions (such as continuity, perhaps) which could plausibly be denied. We paint with a broad brush, aiming to mirror simplicity of ideas with simplicity of presentation. Our aim is to expose the inevitable *language* of physics in a form accessible to neophyte students as well as experienced professionals. This is only a beginning, and we do not proceed to discuss the physical laws that form content of the language.

In attempting to understand the world, we seem forced to think of it in terms of adequately isolated parts with adequately separable properties. To obtain generality, we hypothecate symmetries such that laws applying to one part will apply to another. Otherwise, without consistent rules, we will find no generalities, and our quest will fail. Symmetries are our only hope. If the world is to be comprehended at all, this is the path we should follow. What, if anything, the symmetries apply to will only be apparent later when we try to match our intellectual constructions to sensory impressions of the world.

#### **2. Mathematics**

We first suppose a "with" operator that links two parts ("stuff") to make compound stuff. Technically, this operator is deemed to possess *closure*: stuff -with- stuff = stuff. Furthermore, since we have closure, we can continue adding other stuff to produce yet

**Citation:** Skilling, J.; Knuth, K.H. The ABC of Physics. *Phys. Sci. Forum* **2021**, *3*, 9. https://doi.org/10.3390/ psf2021003009

Academic Editors: Wolfgang von der Linden and Sascha Ranftl

Published: 10 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

more stuff. For this to be fully useful, we suppose that we can shuffle stuff around without it making any difference. Formally, shuffling is associative commutativity.

$$\begin{array}{ll} A \text{ - with-} \begin{pmatrix} B \ \text{-with-} C \end{pmatrix} = \begin{pmatrix} A \ \text{-with-} B \end{pmatrix} \text{-with-} C & \text{associative} \\ A \text{-with-} B = B \text{-with-} A & \text{commutativity} \end{array} \tag{1}$$

If our stuff has only one property of interest, it is not hard to show that the "with" operator can without loss of generality be taken to be standard arithmetical addition [1–4]. Any other representation is isomorphic. In order to model an arbitrarily large world, we suppose that combination can be conducted indefinitely without repeat: *X* -with-*Y* = *X* unless *Y* is null. Otherwise, we would obtain the wrap around for integers limited to finite storage. This is why stuff adds up [5].

$$\begin{array}{ccc} a + (b + c) = (a + b) + c & \text{associative} \\ a + b = b + a & \text{commutativity} \end{array} \tag{2}$$

It is not difficult: Children understand it as they learn to count. The wide utility of addition mirrors the ubiquity of symmetry under shuffling. There are many applications, but we might think of shuffling as a parallel operation which could take place in some sort of proto-space.

Now, we will want "+" to continue to work regardless of who supplies the stuff or, for that matter, who receives it; thus, we next suppose a "then" operator that transfers stuff, possibly with modification, as in *A* -then- *B*. This operator is also supposed to possess closure: *A* -then- *B* -then- is itself a transfer *C* -then- so that suppliers can be chained, and as before we suppose that chaining can be performed indefinitely. We might think of -thenas a series operation that could take place in some sort of proto-time.

For -with- to be additive regardless of suppliers or receivers, we need -then- to be distributive.

$$\begin{array}{ll} \text{A - then-} \ (B \text{ - with-} \text{C}) = (A \text{ - then-} \text{-} B) \text{ -with-} \begin{array}{ll} \text{A - then-} \text{-C} \end{array} & \text{left distributive} \\ \text{- (A - with-} \text{-} B) \text{ - then-} \text{-C} = (A \text{ - then-} \text{-C}) \text{ -with-} \begin{array}{ll} \text{- (B - then-} \text{-C)} \end{array} & \text{right distributive} \end{array} \end{array} \tag{3}$$

With only one property of interest in play, the representation is merely scalar. The only freedom allowed within the sum-rule convention is scaling; thus, the representation "·" of -then- has to be standard arithmetical multiplication, possibly scaled by some constant *γ* [3,4]:

$$a \cdot b = \gamma ab \qquad \text{(product rule)}\tag{4}$$

which is most commonly set by convention to one. Again, this is not difficult: children learn it informally as they are taught multiplication, and they learn *γ* with percentages.

Note that we are building arithmetic and not Boolean calculus. Our "then" describes sequencing and is distributive over "with" but not the other way around. We point out that only with standard arithmetic in place can we build the sophisticated mathematics, which gives us quantified science. This is why mathematics is so effective—we have *designed* it to obey the fundamental symmetries which are required for our generalisations. Moreover, we need the symmetries, for denial would remove our mathematics and render us powerless. How could we justify Hilbert spaces if we cannot even add up?

#### **3. Physics**

Physics is more subtle, for arbitrary precision is not available with finite probes (Figure 1, left).

**Figure 1.** (**Left**) Uncertainty: probe and fragile target. (**Right**) Measurement: delicate probe with target.

This means that any particular quantity is necessarily associated with an uncertainty. Hence, a minimal description (and anything more would lack rationale and be mathematically redundant) requires a *pair* of numbers to quantify a property. We do not mean "*μ* ± *σ*," which is merely crude shorthand for a distribution Pr(*x*) over the available values *x*—in which those numbers too would be uncertain, resulting in indefinite regress. The fundamentally irreducible pair-wise connection is simpler and more intimate than that. Uncertainty becomes part of the very *language* of physics and is impossible to remove.

To define the connections analogous to scalar addition and multiplication, we start by imposing the same ubiquitous symmetries as before. Associative commutativity yields component-wise addition for -with- :

$$
\begin{pmatrix} a\_1 \\ a\_2 \end{pmatrix} + \begin{pmatrix} b\_1 \\ b\_2 \end{pmatrix} = \begin{pmatrix} a\_1 + b\_1 \\ a\_2 + b\_2 \end{pmatrix} \tag{5}
$$

and distributivity yields bilinear multiplication for -then- :

$$
\begin{pmatrix} a\_1 \\ a\_2 \end{pmatrix} \cdot \begin{pmatrix} b\_1 \\ b\_2 \end{pmatrix} = \begin{pmatrix} \gamma\_1 a\_1 b\_1 + \gamma\_2 a\_1 b\_2 + \gamma\_3 a\_2 b\_1 + \gamma\_4 a\_2 b\_2 \\ \gamma\_5 a\_1 b\_1 + \gamma\_6 a\_1 b\_2 + \gamma\_7 a\_2 b\_1 + \gamma\_8 a\_2 b\_2 \end{pmatrix} \tag{6}
$$

but now with eight apparently arbitrary coefficients *γ* and not only a single removable scale factor. It is true that we have four degrees of freedom to choose coordinates, but that still leaves four of the *γ*'s free so that multiplication is not yet adequately defined. This extra subtlety of physics requires us to make the distributivity *associative*, meaning that the effect of chaining does not depend on how the factors are grouped into compounds.

$$a \cdot (b \cdot c) = (a \cdot b) \cdot c \qquad \text{associative} \tag{7}$$

In one dimension, associativity was automatically an emergent property of multiplication, but we need to impose it in two dimensions. Associative multiplication of pairs yields quadratic equations for the *γ*'s for which its multiple solutions A,B,C,D,E and F enable the rich framework of physics. In summary, the following is the case.


Explicitly, the six product rules are the following [6–8].

$$
\begin{aligned}
\underbrace{\begin{pmatrix} a\_1\\a\_2 \end{pmatrix} \cdot \begin{pmatrix} b\_1\\b\_2 \end{pmatrix}}\_\Lambda &= \\
\underbrace{\begin{pmatrix} a\_1b\_1 - a\_2b\_2\\a\_1b\_2 + a\_2b\_1 \end{pmatrix}}\_\text{or} \underbrace{\begin{pmatrix} a\_1b\_1 + a\_2b\_2\\a\_1b\_2 + a\_2b\_1 \end{pmatrix}}\_\text{B} & \text{or} \\
\underbrace{\begin{pmatrix} a\_1b\_1 - a\_2b\_2\\a\_1b\_2 + a\_2b\_1 \end{pmatrix}}\_\text{C} & \text{or} \underbrace{\begin{pmatrix} b\_1\\b\_2 \end{pmatrix}}\_\text{D} \end{aligned}
\text{or} \underbrace{\begin{pmatrix} a\_1\\a\_2 \end{pmatrix}}\_\text{D} & \text{or} \underbrace{\begin{pmatrix} a\_1b\_1\\0 \end{pmatrix}}\_\text{F} \end{aligned} \text{(9)}
$$

The first three, A,B and C, are pair-pair products, the next two, D and E, are degenerate scalar-pair products, and the last F is ordinary scalar–scalar multiplication appearing as the doubly degenerate case.

#### **4. Product Rule F**

It was inevitable that two-dimensional multiplication would include one-dimensional scalar multiplication as a special case, and what our derivation demonstrates is that there is necessary consistency between one and two dimensions. Both follow from the same basic symmetries of shuffling and sequencing. In one dimension, the scalar sum and product rules are those of probability, observed from this viewpoint as part of an intellectual structure common to both mathematics and physics.

Whether quantum or classical, physics makes predictions expressed as *likelihoods* Pr(outcome | setup), assuming what we think we know about the setup, expressed as the *prior*. Bayesian inference then uses actual outcomes to refine the predictions (the *posterior*) and assess the predictive quality of our assumptions (the *evidence*). Probability and quantum theory (which is basic physics) share a common foundation, and quantum behaviour fits seamlessly into Bayesian analysis no differently to anything else. There can be no quantum weirdness in this approach. It is all only ordinary inference.

Note our bottom-up strategy. Our assumptions refer to basic symmetries of only two or three items at a time. We start with 1 + 1 = 2 items and build up from there. This is opposite to approaches of superficially greater sophistication which specify top-down from infinity ∑<sup>∞</sup> <sup>1</sup> (·). That strategy is deliberate. Infinity and the continuum involve delicate limits that might not hold in all circumstances. We think it perversely fragile to traverse delicate analysis in order to return, ultimately, to discrete predictions which never needed the delicacy in the first place.

#### **5. Product Rule A**

Product rule A is complex multiplication, which we can write in operator form as follows.

$$a \cdot = \begin{pmatrix} a\_1 & -a\_2 \\ a\_2 & a\_1 \end{pmatrix} = r e^{i\theta} \tag{10}$$

The condition *<sup>r</sup>* <sup>=</sup> 1 selects operators which under repeated application (*<sup>a</sup>* · )*<sup>n</sup>* cause neither divergence towards infinity (*r* > 1) nor collapse towards zero (*r* < 1). Thus, *r* = 1 defines *unit quantity*. Knowledge of quantity is invariant under such unit-modulus operations, which add constants to the phase of the operand while leaving its modulus unchanged. Correspondingly, our knowledge Pr(*φ*) of the phase of an operand is invariant to offset, obeying Pr(*φ*) = Pr(*φ* + *θ*) for any *θ*, indicating uniformity.

$$\Pr(\phi) = \frac{1}{2\pi} = \text{uniform over } \phi \in [0, 2\pi) \tag{11}$$

Later, it will be observed that neither of the other pair–pair product rules can support a proper prior distribution, and *this is why quantum theory uses complex numbers*. It is the only way of enabling consistency with probability. Uncertainty being identified with phase, the modulus becomes associated with quantity, and the unit modulus represents unit quantity.

Incidentally, phase has to be a continuous variable. If it was not, any gap could be filled in by adding complex numbers from either side: one from the upper boundary and one from the lower in order to obtain a sum with intermediate phase. Whilst absolute phase of any newly introduced quantity is unknown (and uniformly distributed), *relative* phase is the critical ingredient that distinguishes representation of quantity-with-uncertainty from representation of quantity alone. Relative phase manifests as interference.

#### **6. Born Rule**

When we combine objects, the scalar sum rule demands that we add quantity. That is the definition: quantity is what adds up. So what scalar property of complex numbers could be additive? Complex moduli do not add up, either directly |*a* + *b*| = |*a*| + |*b*| or through any function *f*(|*a* + *b*|) = *f*(|*a*|) + *f*(|*b*|). Phase differences interfere and appear to force an inconsistency with scalar addition. However, we do not know the phases of independent objects; thus, we do not actually know what the interference should be. Fortunately, the rules of probability instruct us to average ("marginalise") over what we do not know in order to arrive at the reproducible behaviour that we seek. Moreover, we *can* attain consistency on average, *f*(|*a* + *b*|) = *f*(|*a*|) + *f*(|*b*|) , provided we use *f*(*x*) = *x*2. That modulus-squared form of the following:

$$\left\langle \left| \left| a\epsilon^{ia} + b\epsilon^{j\beta} \right|^2 \right\rangle\_{a,\beta} = \left\langle \left| a\epsilon^{ia} \right|^2 \right\rangle\_a + \left\langle \left| b\epsilon^{j\beta} \right|^2 \right\rangle\_{\beta} = \left| a \right|^2 + \left| b \right|^2 \tag{12}$$

is forced upon us and yields a general additivity that holds for any number of components. The average-modulus-squared relationship is the archetype of the *Born rule* of quantum theory [9]. By updating our nomenclature of complex numbers to the more traditional *ψ*, we have additive scalar quantification |*ψ*| <sup>2</sup> , which we interpret as follows.

$$Q = \left\langle |\psi|^2 \right\rangle = \text{ rate of supply of } \psi = \text{ intensity.} \tag{13}$$

This is how uncertainty manifests in the formalism. The observably additive intensity emitted by sources and observed by receivers is a modulus-squared *average*. Averaging can be performed artificially by picturing an ensemble of possibilities, as expressed either algebraically with a formula for Pr(·) or arithmetically as Monte Carlo samples. In the laboratory, averaging is performed by repetition. Thus, *the fundamental object of inquiry is a stream* and not, as commonly assumed, a particle. A particle is less fundamental because it carries the extra information that a detector had fired. Streams add up. Particles only add up if they are independent because otherwise phase difference is liable to cause interference so that 1 -with- 1 could become anything from 0 to 4.

Observing *Q* yields information about *ψ* partially because the phase remains unknown, and the observation is only an average. The probability distribution for *ψ* under this quadratic constraint *Q* is assigned by maximum entropy as the Gaussian:

$$\Pr(\psi \mid Q) = \frac{\exp(-|\psi|^2/Q)}{\pi Q} \tag{14}$$

just as standard in classical inference and signal processing.

The foundations of quantum theory begin to become apparent, but so far all we have quantified is existence. We need more. We need properties.

#### **7. Product Rule B**

Product rule B is hyperbolic (or "split-complex") multiplication, written in operator form as follows.

$$a \cdot = \begin{pmatrix} a\_1 & a\_2 \\ a\_2 & a\_1 \end{pmatrix} \qquad \text{with} \qquad r = \sqrt{|\det(a \cdot)|} \tag{15}$$

The condition *<sup>r</sup>* <sup>=</sup> 1 selects operators which under repeated application (*<sup>a</sup>* · )*<sup>n</sup>* cause neither divergence towards infinity (*r* > 1) nor collapse towards zero (*r* < 1). Thus, *r* = 1 defines *unit quantity*. Instead of complex phase arctan(*a*2/*a*1), we have pseudophase *φ* = arctanh(*a*2/*a*1). Offsets of *φ* leave *r* unchanged, but the range is unbounded *φ* ∈ (−∞, ∞). Hence, as anticipated, we could not use *φ* to express uncertainty because the corresponding uniform prior would be improper. Rule B does not admit some alternative representation of quantity and uncertainty.

Instead, we upgrade our inquiry to binary this-or-that *properties*, each of which can exist or not and is represented by a quantity and uncertainty pair. We call the binaries of the following:

$$
\psi = \begin{pmatrix} \psi\_\uparrow \\ \psi\_\downarrow \end{pmatrix} = \begin{pmatrix} \psi\_0 + i\psi\_1 \\ \psi\_2 + i\psi\_3 \end{pmatrix} \tag{16}
$$

*qubits*. The detection of |*ψ*↑| <sup>2</sup> by a <sup>↑</sup>-detector would signify the supply of <sup>↑</sup> and similarly for ↓ so that |*ψ*↑| <sup>2</sup> <sup>+</sup> <sup>|</sup>*ψ*↓| <sup>2</sup> would signify <sup>↑</sup>-or-<sup>↓</sup> presence of qubits with either property.

Shuffling and sequencing are still to be obeyed, and the ABCDEF rules still apply in the complex field. So far, we are invoking rules A and B. For the sake of simplicity, we restrict attention to unit operators and write them in generator form. The generator form of the following:

$$a \cdot = \exp(\phi \mathbf{G}) = \lim\_{n \to \infty} \left( \mathbf{1} + \frac{\mathbf{G}}{n} \right)^{n\phi} \tag{17}$$

allows us to separate the (complex) magnitude *φ* of a multiplication (*a* ·) from its structure *G*. From (9), the unit operators for A and B are as follows:

$$
\begin{pmatrix}
\cos\phi & -\sin\phi \\
\sin\phi & \cos\phi
\end{pmatrix} = \exp(\phi \mathbf{A}) \quad \text{where} \quad \mathbf{A} = \begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix} \\
$$

$$
\begin{pmatrix}
\cosh\phi & \sinh\phi \\
\sinh\phi & \cosh\phi
\end{pmatrix} = \exp(\phi \mathbf{B}) \quad \text{where} \quad \mathbf{B} = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}
$$

where we recognise the Hermitian *Pauli matrices:*

$$
\sigma\_0 = \underbrace{\begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}}\_{\mathbf{i}}, \quad \sigma\_x = \underbrace{\begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}}\_{\mathbf{B}}, \quad -\sigma\_y = \underbrace{\begin{pmatrix} 0 & -i \\ i & 0 \end{pmatrix}}\_{\mathbf{i}\mathbf{A}}, \quad \sigma\_z = \underbrace{\begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix}}\_{\mathbf{B}\mathbf{A}} \tag{19}
$$

that mix trigonometric and hyperbolic rotations in complex context.

#### *7.1. Quantification*

For quantification, we still have the Born rule, initially as individual averages of |*ψ*↑| 2 and |*ψ*↓| 2, but quickly promoted to *ψ*†**<sup>H</sup>** *<sup>ψ</sup>* for any Hermitian **<sup>H</sup>**. Specifically, the Pauli matrices yield scalar observables.

$$\begin{aligned} p\_0 &= \langle \psi^\dagger \sigma\_0 \,\psi \rangle = \langle \psi\_0^2 + \psi\_1^2 + \psi\_2^2 + \psi\_3^2 \rangle \\ p\_x &= \langle \psi^\dagger \sigma\_x \,\psi \rangle = 2 \langle \psi\_0 \psi\_2 + \psi\_1 \psi\_3 \rangle \\ p\_y &= \langle \psi^\dagger \sigma\_y \,\psi \rangle = 2 \langle \psi\_0 \psi\_3 - \psi\_1 \psi\_2 \rangle \\ p\_z &= \langle \psi^\dagger \sigma\_z \,\psi \rangle = \langle \psi\_0^2 + \psi\_1^2 - \psi\_2^2 - \psi\_3^2 \rangle \end{aligned} \tag{20}$$

Of these, *<sup>p</sup>*<sup>0</sup> observes qubits as a whole. Given a covariance matrix *<sup>ρ</sup>* <sup>=</sup> *ψ ψ*† derived from such observations (which in practice would often include statistical uncertainty too), the maximum entropy assignment of probability in any dimension is the Gaussian.

$$\Pr(\psi \mid \rho) = \frac{\exp(-\psi^\dagger \rho^{-1} \psi)}{\det(\pi \rho)} \tag{21}$$

This distribution of *ψ* can be used to predict observable quantities in the usual manner that is no differently from any other application of probability. Knowing what a system is enables us to predict how it will behave, whether that is probabilistic or definitive.

It is clear from (21) that initial coordinates could be replaced by any unitary combination.

$$\Pr(\psi \mid \rho) = \frac{\exp\left(-(\mathbf{U}\psi)^{\dagger}(\mathbf{U}\rho\mathbf{U}^{\dagger})^{-1}(\mathbf{U}\psi)\right)}{\det(\pi\rho)}, \qquad (\mathbf{U}^{\dagger}\mathbf{U} = \mathbf{1})\tag{22}$$

Hence, there is no observational test of whether *ψ* or **U***ψ* is the fundamental representation; thus, our physics needs to be invariant under unitary transformation. This gives the complex representations of quantum theory flexibility beyond that accessible to classical scalars for which quantity is restricted to positive values.

#### *7.2. Quantisation*

We become aware of quantisation when a probe initialised in a fragile metastable state is brought to the target stream (Figure 1(right)). If the target triggers effectively irreversible descent of the probe into a disordered state (cleverly engineered to be a macroscopic pointer angle or suchlike), then that becomes available as an essentially permanent macroscopic bit of information signifying that something (a quantum) had been present in the target stream at that time. It is from this point that the experimenter may in person or by proxy become aware of the digital event and incorporate it into probabilistic modelling.

Assuming that the interaction had been engineered to preserve the identities of target and probe, then the something (the quantum) could then be intercepted from the ongoing stream and preserved for later use. Its total modulus would then be known (and conventionally assigned as 1) so that its "wave function" *ψ* would be represented by the distribution:

$$\Pr(\psi) \propto \delta(\psi^\dagger \psi - 1) \tag{23}$$

with known modulus and unknown phase. With the wave function *ψ* thereby normalised and confined to the Hilbert sphere *ψ*†*ψ* = 1, the covariance *ρ* is known as the *density matrix*. However, it may be noted that such confinement damages the smooth elegance of the Gaussian analysis of streams.

This is the source of entanglement, contextuality and such quintessentially quantum phenomena. It is simply Bayesian analysis in the unfamiliar context of complex numbers (as suggested by [10]) and in indirect modulus-squared observation, usually written in the Dirac bracket notation with *<sup>ψ</sup>* <sup>=</sup> <sup>|</sup>*<sup>ψ</sup>* and *<sup>ψ</sup>*† <sup>=</sup> *ψ*|, that was developed for physics [11,12] before the Bayesian paradigm became pre-eminent in inference.

For example, if *pz* were observed equal to *p*0, then the definitions (20) would force *<sup>ψ</sup>* to be purely *<sup>ψ</sup>*↑, an eigenvector of *<sup>σ</sup><sup>z</sup>* (with unknown phase) with covariance *<sup>ρ</sup>* = *<sup>ψ</sup>*<sup>↑</sup> *<sup>ψ</sup>*† <sup>↑</sup> . Subsequent prediction of *<sup>ψ</sup>* with (21) would ensure that *pz* remained equal to *p*0. Measurements of *px*, on the other hand, would average to zero upon repetition and restriction to the Hilbert sphere (23) of an individual quantum would force individual outcomes to take either one of the eigenvectors of *σ<sup>x</sup>* at random, incidentally destroying the memory of the earlier *pz*. We see that "wave-function collapse" is just the incorporation of new data into ordinary Bayesian inference on the Hilbert sphere.

#### *7.3. Transformations*

The Pauli matrices generate the 6-parameter *Lorentz group* of unit operators.

$$\exp(\phi\_x \sigma\_x + \phi\_y \sigma\_y + \phi\_z \sigma\_z) \quad \text{with complex coefficients } \phi = -\frac{1}{2}(\zeta + i\eta) \tag{24}$$

On using *η<sup>z</sup>* only, the Pauli observables transform as follows:

$$
\begin{pmatrix} p'\_0 \\ p'\_x \\ p'\_y \\ p'\_z \end{pmatrix} = \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & \cos \eta & -\sin \eta & 0 \\ 0 & \sin \eta & \cos \eta & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} p\_0 \\ p\_x \\ p\_y \\ p\_z \end{pmatrix} \tag{25}
$$

while on using *ξ<sup>z</sup>* only, they transform as follows.

$$
\begin{pmatrix} p'\_0 \\ p'\_x \\ p'\_y \\ p'\_z \end{pmatrix} = \begin{pmatrix} \cosh\frac{x}{\theta} & 0 & 0 & -\sinh\frac{x}{\theta} \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ -\sinh\frac{x}{\theta} & 0 & 0 & \cosh\frac{x}{\theta} \end{pmatrix} \begin{pmatrix} p\_0 \\ p\_x \\ p\_y \\ p\_z \end{pmatrix} \tag{26}
$$

In each case, we have the following:

$$p\_0^2 - p\_x^2 - p\_y^2 - p\_z^2 = m^2 > 0 \text{ is invariant} \tag{27}$$

and the axis could have been in any (*x*, *y*, *z*) direction. We recognise the rotation and boost behaviour of 4-momentum, but the identification would be premature because we do not yet have spacetime.

#### **8. Product Rule C**

Product rule C in operator form is as follows.

$$a \cdot = \begin{pmatrix} a\_1 & 0 \\ a\_2 & a\_1 \end{pmatrix} \tag{28}$$

As before, *<sup>r</sup>* <sup>=</sup> √| det(*<sup>a</sup>* · )<sup>|</sup> <sup>=</sup> 1 defines *unit quantity*. Instead of complex phase arctan(*a*2/*a*1), we have pseudo-phase *t* = *a*2/*a*1. Offsets of *t* leave *r* unchanged, but the range is unbounded *t* ∈ (−∞, ∞). Hence, as anticipated, we could not use *t* to express uncertainty because the uniform prior would be improper. The only choice really was rule A.

Unit operations now take the following form.

$$
\mathbf{C} \begin{pmatrix} 1 & 0 \\ t & 1 \end{pmatrix} = \exp(t\mathbf{C}) \quad \text{where} \quad \mathbf{C} = \begin{pmatrix} 0 & 0 \\ 1 & 0 \end{pmatrix} \tag{29}
$$

This acts as an integrator:

$$
\begin{pmatrix} 1 & 0 \\ t & 1 \end{pmatrix} \begin{pmatrix} u \\ U \end{pmatrix} = \begin{pmatrix} u \\ U+ut \end{pmatrix} \tag{30}
$$

so that *U* is the integral of *u* with *dU* = *u dt*. Note that **C** operates in the real domain. It is not Hermitian, so it applies to scalar operands and not to complex entities such as *ψ*, and its coefficient *t* will be real. Rule C meaningfully applies to (real-valued) phase, rephasing *ψ* to *ψe*−*i<sup>θ</sup>* (the minus sign is conventional).

Take a qubit *ψ* with invariant *m*. Define *dτ* through *dθ* = *m dτ* by using *m* as a clock. Under Pauli, we have observed that *m* transforms as the 4-vector (*p*0, *px*, *py*, *pz*), which is covariant by convention. Phase *θ* is a 2*π*-periodic scalar that cannot be rescaled because its 2*π*-periodicity represents *interference*—the distinguishing observational feature of quantum theory. Hence, *dτ* must transform as a contravariant 4-vector (*dt*, *dx*, *dy*, *dz*) with invariant of the following:

$$(d\tau)^2 = (dt)^2 - (dx)^2 - (dy)^2 - (dz)^2\tag{31}$$

obeying the following:

$$d\theta = \left(p\_0 dt - p\_x dx - p\_y dy - p\_z dz\right) / \hbar \tag{32}$$

where the arbitrary constant *h*¯ (which is best set to one) defines our units of *t* and **x** relative to *m* and **p**. With *ψ* being rephased by *e*−*iθ*, the Schrödinger equations of the following:

$$i\hbar\frac{\partial\Psi}{\partial t} = p\_0\psi \,, \quad i\hbar\frac{\partial\Psi}{\partial x} = -p\_x\psi \,, \quad i\hbar\frac{\partial\Psi}{\partial y} = -p\_y\psi \,, \quad i\hbar\frac{\partial\Psi}{\partial z} = -p\_z\psi \,. \tag{33}$$

are immediate.

All that remains is to recognise the symbols of the following.


Moreover, we must identify the Lorentz coefficients *<sup>φ</sup>* <sup>=</sup> <sup>−</sup><sup>1</sup> <sup>2</sup> (*ξ* +*iη*) within Minkowski spacetime.


#### *8.1. Viewpoint*

Traditionally, energy and momentum are obtained from the Schrödinger equations as differentials with respect to time and space. With ABC, it is the other way round—time and space emerge from quantum theory as integrals of energy and momentum.

$$(p\_{0\prime}, p\_{x\prime}, p\_{y\prime}, p\_z) \overset{\text{ABC}}{\underset{\text{fractional}}{\rightleftharpoons}} (t, x, y, z)$$

**Spacetime is a construct of quantum, not a container for it.**

The ABC derivation dictates why space has three dimensions—and that is the right way round. Space and time are measured by theodolites and clocks, and observations are at root quantum in nature.

We have now developed a unified foundation based on the necessary incorporation of uncertainty with quantity.

Further development would include formalising qubit–qubit interactions (two Pauli matrices) by using the Dirac equation and continuing to quantum field theory. That would allow theoretical demonstration of idealised measurement probes and of how observation can be used both to predict future evolution and to retrodict past behaviour. However, since ABC recovers the standard equations, new insight would not be immediately expected.

#### **9. Speculations**

With a consistent formulation based on so few and general assumptions, it seems almost inconceivable that inconsistency could arise in future developments. Would a user wish to deny uncertainty, or shuffling, or sequencing?

However, there is more to be conducted. Although spacetime has been constructed with a locally Minkowski metric, inheriting continuity from complex phase, ABC does not require the manifold to be flat. We should presumably expect that curvature will follow physical matter as general relativity indicates. Yet gravitational singularities appear to be inconsistent with the required reversibility of quantum formalism.

One may ask for the source of that reversibility, for it is at least superficially plausible that dropping an object into a black hole is *not* a reversible process—at least not on a timescale of interest to the experimenter. Since it appears that the fundamental object is not a particle on its Hilbert sphere but a stream represented by a Gaussian distribution, it seems that there is scope for revisiting this question from a revised viewpoint.

Traditional developments of quantum theory and of geometry have accepted the notion of a division algebra [13,14] in which every operation is presumed to have an inverse. However, that may not be true. ABC does not need to assume inverses. They might fail in extreme situations. Assuming division algebras may have been an assumption too far.

One may also speculate on the roles of the unused product rules D and E. Traditionally, these scalar-vector product rules are imposed as part of the axiomatic structure of a vector space. Users are just expected to accept them on the basis of initial plausibility, familiarity

and submission to conventional authority. However, ABC did not need to assume them. In the two-parameter quantity and uncertainty space, D and E appear as additional candidate rules. Each of A, B, C, and F had critically important roles. Why would D and E not have critically important roles too?

Inquiry at this depth does not throw up results that we should casually discard. Rules D and E suggest (demand?) the existence of scalar stuff different from the pair-valued stuff of our original inquiry. We might only be able to interact with it through the curvature of space, in other words gravity. Is this dark matter? Might we have predicted this possibility ahead of time if we had thought this through 50 years ago?

**Author Contributions:** All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **References**


### *Proceeding Paper* **A Weakly Informative Prior for Resonance Frequencies †**

**Marnix Van Soom \* and Bart de Boer**


**Abstract:** We derive a weakly informative prior for a set of ordered resonance frequencies from Jaynes' principle of maximum entropy. The prior facilitates model selection problems in which both the number and the values of the resonance frequencies are unknown. It encodes a weakly inductive bias, provides a reasonable density everywhere, is easily parametrizable, and is easy to sample. We hope that this prior can enable the use of robust evidence-based methods for a new class of problems, even in the presence of multiplets of arbitrary order.

**Keywords:** weakly uninformative prior; resonance frequency; model selection; maximum entropy

#### **1. Introduction**

An important problem in the natural sciences is the accurate measurement of resonance frequencies. The problem can be formalized by the following probabilistic model:

$$p(D, \mathbf{x}|I) = p(D|\mathbf{x})p(\mathbf{x}|I) \equiv \mathcal{L}(\mathbf{x})\pi(\mathbf{x}),\tag{1}$$

where *<sup>D</sup>* is the data, *<sup>x</sup>* <sup>=</sup> {*xk*}*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> are the *K* resonance frequencies of interest, and *I* is the prior information about *x*. As an example instance of (1), we refer to the vocal tract resonance (VTR) problem discussed in Section 5 for which *D* is audio recorded from the mouth of a speaker; *x* are a set of *K* VTR frequencies, and the underlying model is a sinusoidal regression model. Any realistic problem will include additional model parameters *θ*, but these have been silently ignored by formally integrating them out of (1), i.e., *<sup>p</sup>*(*D*, *<sup>x</sup>*|*I*) = <sup>d</sup>*<sup>θ</sup> <sup>p</sup>*(*D*, *<sup>x</sup>*, *<sup>θ</sup>*|*I*).

In this paper, we assume that the likelihood <sup>L</sup>(*x*) <sup>≡</sup> *<sup>p</sup>*(*D*|*x*) is given, and our task is to choose an *uninformative* prior *<sup>π</sup>*(*x*) <sup>≡</sup> *<sup>p</sup>*(*x*|*I*) from *limited* prior information *<sup>I</sup>*. A conflict arises, however:

The uninformative priors *π* most commonly chosen to express limited prior information *<sup>I</sup>* are, in practice, often precluded by that same *<sup>I</sup>*. (2)

The goal of this paper is to describe this conflict (2) and to show how it can be resolved by adopting a specific choice for *π*. This allows robust inference of the number of resonances *K* in the important case of such limited prior information *I*, which in turn enables accurate measurement of the resonance frequencies *x* with standard methods such as nested sampling [1] or reversible jump MCMC [2].

#### **2. Notation**

The symbol *π* is intended to convey a vague notion of a generally uninformative or weakly informative prior. Definite *choices* for *π* are indicated with the subscript *i*:

$$\pi\_i(\mathbf{x}) \equiv p(\mathbf{x}|\mathcal{B}\_i, I\_i), \qquad \quad (i = 1, 2, 3), \tag{3}$$

**Citation:** Van Soom, M.; de Boer, B. A Weakly Informative Prior for Resonance Frequencies. *Phys. Sci. Forum* **2021**, *3*, 2. https://doi.org/ 10.3390/psf2021003002

Academic Editors: Wolfgang von der Linden and Sascha Ranftl

Published: 4 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

where *β<sup>i</sup>* is a placeholder for the hyperparameter specific to *πi*. Note that in the plots below and for the experiments in Section 5, the values of the *β<sup>i</sup>* are always set according to Table 1.

**Table 1.** The values of the hyperparameters *β<sup>i</sup>* used throughout the paper. All quantities are given in units of Hz.


Each *π<sup>i</sup>* uniquely determines a number of important high-level quantities since the likelihood <sup>L</sup>(*x*) and data *<sup>D</sup>* are assumed to be given. These quantities are the *evidence* for the model with *K* resonances:

$$Z\_i(K) = \int \mathrm{d}^K \mathbf{x} \, \mathcal{L}(\mathbf{x}) \, \pi\_i(\mathbf{x}),\tag{4}$$

the *posterior*:

$$P\_i(\mathbf{x}) = \frac{\mathcal{L}(\mathbf{x})\pi\_i(\mathbf{x})}{Z\_i(\mathbf{K})},\tag{5}$$

and the *information*:

$$H\_i(\mathbf{K}) = \int \mathbf{d}^{\mathbf{K}} \mathbf{x} \, P\_i(\mathbf{x}) \, \log \frac{P\_i(\mathbf{x})}{\pi\_i(\mathbf{x})'} \, \tag{6}$$

which measures the amount of information obtained by updating from prior *π<sup>i</sup>* to posterior *Pi*, i.e., *Hi*(*K*) ≡ *D*KL(*Pi*|*πi*), where *D*KL is the Kullback–Leibler divergence.

#### **3. Conflict**

The *uninformative priors π* referenced in (2) are of the independent and identically distributed type:

$$\pi(\mathbf{x}) = \prod\_{k=1}^{K} \mathbf{g}(\mathbf{x}\_k|\boldsymbol{\mathfrak{g}}) , \tag{7}$$

where *g*(*x*|*β*) is any wide distribution with hyperparameters *β*. A typical choice for *g* is the uniform distribution over the full frequency bandwidth; other examples include diffuse Gaussians or Jeffreys priors [3–9].

Second, the *limited prior information I* in (2) about *K* implies that the problem will involve model selection, since each value of *K* implicitly corresponds to a different model for the data. It is, thus, necessary to evaluate and compare evidence *<sup>Z</sup>*(*K*) = <sup>d</sup>*K<sup>x</sup>* <sup>L</sup>(*x*)*π*(*x*) for each plausible *K*.

The conflict between these two elements is due to the *label switching problem*, which is a well-known issue in mixture modeling, e.g., [10]. The likelihood functions <sup>L</sup>(*x*) used in models parametrized by resonance frequencies are typically invariant to switching the label *k*; i.e., the index *k* of the frequency *xk* has no distinguishable meaning in the model underlying the data. The posterior *<sup>P</sup>*(*x*) <sup>∝</sup> <sup>L</sup>(*x*)*π*(*x*) will inherit this exchange symmetry if the prior is of type (7). Thus, if the model parameters *x* are well determined by the data *D*, the posterior landscape will consist of one *primary* mode, which is defined as a mode living in the *ordered region*:

$$\mathcal{R}\_K(\mathbf{x}\_0) = \{ \mathbf{x} | \mathbf{x}\_0 \le \mathbf{x}\_1 \le \mathbf{x}\_2 \le \cdots \le \mathbf{x}\_K \} \quad \text{with} \quad \mathbf{x}\_0 > \mathbf{0}, \tag{8}$$

and (*K*! − 1) induced modes, which are identical to the primary mode up to a permutation of the labels *k* and, thus, live outside of the region R*K*(*x*0). The trouble is that correctly taking into account these induced modes during the evaluation of *Z*(*K*) requires a surpris-

ing amount of extra work in addition to tuning the MCMC method of choice, and that is the label switching problem in our setting. In fact, there is currently no widely accepted solution for the label switching problem in the context of mixture models either [11,12]. This is, then, how in (2) uninformative priors *π* are "precluded" by the limited information *I*: the latter implies model selection, which in turn implies evaluating *Z*(*K*), which is hampered by the label switching problem due to the exchange symmetry of the former. Therefore, it seems better to try to avoid it by encoding our preference for primary modes directly into the prior. This results in abandoning the uninformative prior *π* in favor of the weakly informative prior *π*3, which is proposed in Section 4 as a solution to the conflict.

We use the VTR problem to briefly illustrate the label switching problem in Figure 1. The likelihood <sup>L</sup>(*x*) is described implicitly in Section <sup>5</sup> and is invariant to switching the labels *k* because the underlying model function (23) of the regression model is essentially a sum of sinusoids, one for each *xk*. As frequencies can be profitably thought of as scale variables ([13], Appendix A), the uninformative prior (7) is represented by

$$\pi\_1(\mathbf{x}) \equiv p(\mathbf{x}|\mathbf{x}\_{0\prime}, \mathbf{x}\_{\text{max}}, I\_1) = \prod\_{k=1}^{K} h(\mathbf{x}\_k|\mathbf{x}\_{0\prime}, \mathbf{x}\_{\text{max}})\_\prime \tag{9}$$

where *β*<sup>1</sup> ≡ (*x*0, *x*max) are a common lower and upper bound, and

$$h(\mathbf{x}|a,b) = \begin{cases} \frac{1}{\log(b/a)} \frac{1}{\mathbf{x}} & \text{if } a \le \mathbf{x} \le b \\ 0 & \text{otherwise} \end{cases} \quad \text{with} \quad \begin{array}{l} a > 0 \\ b < \infty \end{array} \tag{10}$$

is the Jeffreys prior, the conventional uninformative prior for a scale variable [although any prior of the form (7) that is sufficiently uninformative would yield essentially the same results.] We have visualized the posterior landscape *P*1(*x*) in Figure 1 by using the pairwise marginal posteriors *P*1(*xk*, *x*) plotted in blue. Note the exchange symmetry of *P*1, which manifests as an (imperfect) reflection symmetry around the dotted diagonal *xk* = *x* bordering the ordered region R3(*x*0). The primary mode can be identified by the black dot; all other modes are induced modes. Integrating all *K*! modes to obtain *Z*(*K*) quickly becomes intractable for *Z* -4.

**Figure 1.** The exchange symmetry of the posterior *P*1(*x*) for a well-determined instance of the VTR problem from Section <sup>5</sup> with *K* := 3. The pairwise marginal posteriors *P*1(*xk*, *x*) are shown using the isocontours of kernel density approximations calculated from posterior samples of *<sup>x</sup>*. For each panel, the diagonal *xk* = *<sup>x</sup>* is plotted as a dotted line, and the ordered region R3(*x*0) is shaded in grey. The black dot marks the mean of the primary mode for this problem.

*A Simple Way Out?*

A simple method out of the conflict is to break the exchange symmetry by assuming specialized bounds for each *xk*:

$$\pi\_2(\mathbf{x}) \equiv p(\mathbf{x}|a, b, l\_2) = \prod\_{k=1}^{K} h(\mathbf{x}\_k | a\_k, b\_k) \,. \tag{11}$$

where *<sup>β</sup>*<sup>2</sup> <sup>≡</sup> (*a*, *<sup>b</sup>*) with *<sup>a</sup>* <sup>=</sup> {*ak*}*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> and *<sup>b</sup>* <sup>=</sup> {*bk*}*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> being hyperparameters specifying the individual bounds. However, in order to enable the model to detect doublets (a resolved pair of two close frequencies such as the primary mode in the leftmost panel in Figure 1), it is necessary to assign overlapping bounds in (*a*, *b*), presumably by using some heuristic. The necessary degree of overlap increases as the detection of higher order multiplets such as triplets (which can and do occur) is desired, but the more overlap in (*a*, *b*), the more the label switching problem returns. Despite this issue, there will be cases where we have sufficient prior information *I* to set the (*a*, *b*) hyperparameters without too much trouble; the VTR problem is such a case for which the overlapping values of (*a*, *b*) up to *K* = 5 are given in Table 1.

#### **4. Solution**

Our solution to the conflict (2) is a chain of *K* coupled Pareto distributions:

$$\pi\_{\mathfrak{P}}(\mathbf{x}) \equiv p(\mathbf{x}|\overline{\mathbf{x\_0}}, l\_{\mathfrak{I}}) = \prod\_{k=1}^{K} \mathsf{P} \mathsf{areeto}(\mathbf{x}\_k|\mathbf{x\_{k-1}}, \lambda\_k) \tag{12}$$

where

$$\mathsf{Pareto}(\mathsf{x}|\mathsf{x}\_{\*},\lambda) = \begin{cases} \frac{\lambda x\_{\*}^{\lambda}}{x^{\lambda+1}} & \text{if } \mathsf{x} \ge \mathsf{x}\_{\*} \\ 0 & \text{otherwise} \end{cases} \quad \text{with} \quad \begin{array}{l} \mathsf{x}\_{\*} > 0 \\ \lambda > 0, \end{array} \tag{13}$$

and the hyperparameter *<sup>β</sup>*<sup>3</sup> <sup>≡</sup> *<sup>x</sup>***<sup>0</sup>** is defined as

$$\overline{\mathfrak{X}} \equiv (\overline{\mathfrak{x}\_0}, \overline{\mathfrak{x}}), \quad \overline{\mathfrak{x}\_0} := \mathfrak{x}\_0, \quad \overline{\mathfrak{x}} = \{\overline{\mathfrak{x}\_k}\}\_{k=1}^K, \quad \lambda\_k = \frac{\overline{\mathfrak{X}\_k}}{\overline{\mathfrak{X}\_k} - \overline{\mathfrak{X}\_{k-1}}}. \tag{14}$$

From Figure 2, it can be seen that *π*<sup>3</sup> encodes weakly informative knowledge about *<sup>K</sup> ordered frequencies*: (12) and (13) together imply that *<sup>π</sup>*3(*x*) is defined only for *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*K*(*x*0), while nonzero only for *<sup>x</sup>* ∈ R*K*(*x*0). In other words, its support is precisely the ordered region R*K*(*x*0), which solves the label switching problem underlying the conflict automatically, as the exchange symmetry of *π* is broken. This is illustrated in Figure 2, where *P*<sup>3</sup> contracts to a single primary mode, which is just what we would like.

The *K* + 1 hyperparameters *x***<sup>0</sup>** in (14) are a *common lower bound x*<sup>0</sup> plus *K expected values of the resonance frequencies x*. While the former is generally easily determined, the latter may seem difficult to set given the premise of this paper that we dispose only of limited prior information *I*. Why do we claim that *π*<sup>3</sup> is only weakly informative if it is parametrized by the expected values of the very things it is supposed to be only weakly informative about? The answer is that for any reasonable amount of data, inference based on *π*<sup>3</sup> is completely insensitive to the exact values of *x*. Therefore, any reasonable guess for *x***<sup>0</sup>** will suffice in practice. For example, for the VTR problem, we simply applied a heuristic where we take *xk* = *k* × 500 Hz (see Table 1). This insensitivity is due to the maximum entropy status of *π*<sup>3</sup> and indicates the weak inductive bias it entails. On a more prosaic level, the heavy tails of the Pareto distributions in (12) ensure that the prior will be eventually overwhelmed by the data no matter how a priori improbable the true value of *x* is. More prosaic still, in Section 5.1 below we show quantitatively that for the VTR problem *π*<sup>3</sup> is about as (un)informative as *π*2.

**Figure 2.** Contraction of prior (*π*3) to posterior (*P*3) for the application of *π*<sup>3</sup> to the VTR problem used in Figure 1. The pairwise marginal prior *<sup>π</sup>*3(*xk*, *<sup>x</sup>*) is obtained by integrating out the third frequency; for example, *<sup>π</sup>*3(*x*1, *<sup>x</sup>*2) = <sup>d</sup>*x*<sup>3</sup> *<sup>π</sup>*3(*x*). Unlike *P*<sup>1</sup> in Figure 1, *P*<sup>3</sup> exhibits only a single mode that coincides with the primary mode as marked by the black dot.

#### *4.1. Derivation of π*<sup>3</sup>

Our ansatz consists of interpreting the *x* as a set of *K ordered* scale variables that are bounded from below by *x*0. Starting from (9) and not bothering with the bounds (*a*, *b*), we obtain the improper pdf

$$m(\mathbf{x}) \propto \begin{cases} \prod\_{k=1}^{K} \frac{1}{x\_k} & \mathbf{x} \in \mathcal{R}\_K(\mathbf{x}\_0) \\ 0 & \text{otherwise.} \end{cases} \tag{15}$$

We can simplify (15) using the one-to-one transformation *<sup>x</sup>* <sup>↔</sup> *<sup>u</sup>* defined as

$$\begin{aligned} \mathbf{x} \rightarrow \mathbf{u} &:= \log \frac{\mathbf{x}\_k}{\mathbf{x}\_{k-1}} & (k = 1, 2, \dots, K) \\ \mathbf{u} \rightarrow \mathbf{x} &:= \mathbf{x}\_0 \exp \sum\_{\kappa=1}^k u\_\kappa & (k = 1, 2, \dots, K) \end{aligned} \tag{16}$$

which yields (with abuse of notation for brevity)

$$m(u) \propto \begin{cases} 1 & u \ge 0 \\ 0 & \text{otherwise.} \end{cases} \tag{17}$$

Since model selection requires proper priors, we need to normalize *m*(*u*) by adding extra information (i.e., constraints) to it; we propose to simply fix the *K* first moments *<sup>u</sup>* <sup>=</sup> {*uk* }*<sup>K</sup> <sup>k</sup>*=1. This will yield the Pareto chain prior *<sup>π</sup>*3(*u*) directly, expressed in *<sup>u</sup>* space rather than *x* space. The expression for *π*3(*u*) is found by minimizing the Kullback–Leibler divergence [14]

$$D\_{\rm KL}(\pi\_3|m) = \int \mathrm{d}^K u \,\,\pi\_3(u) \log \frac{\pi\_3(u)}{m(u)}, \quad \text{subject to} \quad \langle u \rangle \equiv \int \mathrm{d}^K u \,\,\mu \pi\_3(u) = \overline{u}, \tag{18}$$

where *<sup>u</sup>* <sup>=</sup> {*uk*}*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> are the supplied first moments. This variational problem is equivalent to finding *π*3(*u*) by means of Jaynes' *principle of maximum entropy* with *m*(*u*) serving as the invariant measure [15]. Since the exponential distribution Exp(*x*|*λ*) is the maximum entropy distribution for a random variable *x* ≥ 0 with a fixed first moment *x* = 1/*λ*, the solution to (18) is

$$\pi\_3(\mathfrak{u}) = \prod\_{k=1}^K \mathbb{E} \exp(\mathfrak{u}\_k | \lambda\_k), \tag{19}$$

where the rate hyperparameters *λ<sup>k</sup>* = 1/*uk* and

$$\mathbb{E}\exp(\mathbf{x}|\boldsymbol{\lambda}) = \begin{cases} \boldsymbol{\lambda}\exp\{-\boldsymbol{\lambda}\mathbf{x}\} & \text{if } \mathbf{x} \ge \mathbf{0} \\ \mathbf{0} & \text{otherwise} \end{cases} \quad \text{with} \quad \boldsymbol{\lambda} > \mathbf{0}. \tag{20}$$

Transforming (19) to *x* space using (16) finally yields (12), but we still need to express *<sup>λ</sup><sup>k</sup>* in terms of *<sup>x</sup>*—we might find it hard to pick reasonable values of *uk* <sup>=</sup> log *xk*/*xk*−<sup>1</sup> from limited prior information *I*. For this, we will need the identity

$$
\langle \mathbf{x}\_k \rangle \equiv \int \mathbf{d}^K \mathbf{x} \,\mathbf{x}\_k \pi\_3(\mathbf{x}) = \frac{\lambda\_k}{\lambda\_k - 1} \langle \mathbf{x}\_{k-1} \rangle \qquad (k = 1, 2, \dots, K). \tag{21}
$$

Constraining *xk* = *xk* and solving for *<sup>λ</sup>k*, we obtain *<sup>λ</sup><sup>k</sup>* = *xk*/(*xk* − *xk*−1), in agreement with (14). Note that the existence of the first marginal moments *xk* requires that *λ<sup>k</sup>* > 1.

#### *4.2. Sampling from π*<sup>3</sup>

Sampling from *π*<sup>3</sup> is trivial because of the independence of the *uk* in *u* space (19). To produce a sample *x*- <sup>∼</sup> *<sup>π</sup>*3(*x*) given the hyperparameter *<sup>x</sup>***0**, compute the corresponding rate parameters {*λk*}*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> from (14), and use them in (19) to obtain a sample *<sup>u</sup>*- <sup>∼</sup> *<sup>π</sup>*3(*u*). The desired *x* is then obtained from *u*using the transformation (16).

#### **5. Application: The VTR Problem**

We now present a relatively simple but realistic instance of the problem of measuring resonance frequencies, which will allow us to illustrate the above ideas. The VTR problem consists of measuring human vocal tract resonance (VTR) frequencies *x* for each of five representative vowel sounds taken from the CMU ARCTIC database [16]. The VTR frequencies *x* describe the *vocal tract transfer function T*(*x*) and are fundamental quantities in acoustic phonetics [17]. The five vowel sounds are recorded utterances of the first vowel in the words *W* = {shore,that, you, little, until}. In order to achieve high-quality VTR frequency estimates *x*ˆ, only the quasi-periodic steady-state part of the vowel sound is considered for the measurement. The data *D*, thus, consists of a string of highly correlated *pitch periods*. See Figure 3 for an illustration of these concepts.

**Figure 3.** The VTR problem for the case (*D* := until, *K* := 10). Left panel: The data *D*, i.e., the quasi-periodic steady-state part, consist of 3 highly correlated pitch periods. Right panel: Inferred VTR frequency estimates {*x*ˆ*k*}*<sup>K</sup> <sup>k</sup>*=<sup>1</sup> for *K* := 10 at 3 sigma. They describe the power spectral density of the vocal tract transfer function |*T*(*x*)| 2, represented here by 25 posterior samples and compared to the Fast Fourier Transform (FFT) of *D*. All *x*ˆ*<sup>k</sup>* are well resolved, and most have error bars too small to be seen on this scale.

The measurement itself is formalized as inference using the probabilistic model (1). The model assumed to underlie the data is the sinusoidal regression model introduced

in [18]; due to limited space, we only describe it implicitly. The sinusoidal regression model assumes that each pitch period *<sup>d</sup>* <sup>∈</sup> *<sup>D</sup>* can be modeled as

$$d\_{\mathbf{l}} = f(t; \mathbf{A}, \mathbf{u}, \mathbf{x}) + \sigma e\_{\mathbf{l}} \quad \text{where} \quad e\_{\mathbf{l}} \sim \mathcal{N}(0, 1), \qquad (t = 1, 2, \dots, T), \tag{22}$$

where *<sup>d</sup>* <sup>=</sup> {*dt*}*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> is a time series consisting of *T* samples. The model function

$$f(t; \mathbf{A}, \mathbf{u}, \mathbf{x}) = \sum\_{k=1}^{K} \left[ A\_k \cos(\mathbf{x}\_k t) + A\_{K+k} \sin(\mathbf{x}\_k t) \right] \exp\{-a\_k t\} + \sum\_{\ell=1}^{L} A\_{2K+\ell} t^{\ell-1} \tag{23}$$

consists of a sinusoidal part (first ∑) and a polynomial trend correction (second ∑). Note the additional model parameters *<sup>θ</sup>* <sup>=</sup> {*A*, *<sup>α</sup>*, *<sup>σ</sup>*, *<sup>L</sup>*}. Formally, given the prior *<sup>p</sup>*(*θ*) ([18], Section 2.2), the marginal likelihood <sup>L</sup>(*x*) is then obtained as <sup>L</sup>(*x*) = <sup>d</sup>*θ*L(*x*, *<sup>θ</sup>*)*p*(*θ*), where the complete likelihood <sup>L</sup>(*x*, *<sup>θ</sup>*) is implicitly given by (22) and (23). Practically, we just marginalize out *<sup>θ</sup>* from samples obtained from the complete problem *<sup>p</sup>*(*D*, *<sup>x</sup>*, *<sup>θ</sup>*|*I*).

For inference, the computational method of choice is nested sampling [1] using the dynesty library [19–23], which scales roughly as <sup>O</sup>(*K*2) [24]. Since the VTR problem is quite simple (*Hi*(*K*) ∼ 30 nats), we only perform single nested sampling runs and take the obtained log *Zi*(*K*) and *Hi*(*K*) as point estimates. Full details on the experiments and data are available at https://github.com/mvsoom/frequency-prior.

#### *5.1. Experiment I: Comparing π*<sup>2</sup> *and π*<sup>3</sup>

In Experiment I, we perform a high-level comparison between *π*<sup>2</sup> and *π*<sup>3</sup> in terms of *evidence* (4) and *information* (6). The values of the hyperparameters used in the experiment are listed in Table 1. We did not include *π*<sup>1</sup> in this comparison as the label switching problem prevented convergence of nested sampling runs for *<sup>K</sup>* <sup>≥</sup> 4. The (*a*, *<sup>b</sup>*) bounds for *π*<sup>2</sup> were based on loosely interpreting the VTRs as formants and consulting formant tables from standard works [25–30]. These allowed us to compile bounds up until the fifth formant such that *K*max = 5. For *π*3, we simply applied a heuristic where we take *xk* = *k* × 500 Hz. We selected *x*<sup>0</sup> empirically (although a theoretical approach is also possible [31]), and *x*max was set to the Nyquist frequency. The role of *x*max is to truncate *π*<sup>3</sup> in order to avoid aliasing effects, since the support of *π*3(*xi*) is unbounded from above. We implemented this by using the following likelihood function in the nested sampling program:

$$\mathcal{L}'(\mathbf{x}) = \begin{cases} \mathcal{L}(\mathbf{x}) & \text{if } \mathbf{x}\_k \le \mathbf{x}\_{\text{max}} \text{ for all } (k = 1, 2, \dots, K) \\ 0 & \text{otherwise.} \end{cases} \tag{24}$$

First, we compare the influence of *π*<sup>2</sup> and *π*<sup>3</sup> on model selection. Given *D* ∈ *W*, the posterior probability of the number of resonances *K* is given by the following.

$$p\_i(K) = \frac{Z\_i(K)}{\sum\_{K'} Z\_i(K')} \qquad (K = 1, 2, \dots, K\_{\text{max}}).\tag{25}$$

The results in the top row of Figure 4a are striking: while *p*2(*K*) shows individual preferences based on *D*, *p*3(*K*) prefers *K* = *K*max unequivocally.

**Figure 4.** (**a**) Model selection in Experiment I (top row) and Experiment II (bottom row). (**b**) In Experiment I, *π*<sup>2</sup> and *π*<sup>3</sup> are compared in terms of evidence [log *Zi*(*K*)] and uninformativeness [*Hi*(*K*)] for each (*D*, *K*). The arrows point from *π*<sup>2</sup> to *π*<sup>3</sup> and are color-coded by the value of *K*. For small values of *K*, the arrow lengths are too small to be visible on this scale.

Second, in Figure 4b, we compare *π*<sup>2</sup> and *π*<sup>3</sup> directly in terms of differences in evidence [log *Zi*(*K*)] and uninformativeness [*Hi*(*K*)] for each combination (*D*, *K*).

Arrows pointing *eastward* indicate *Z*3(*K*) > *Z*2(*K*). The *π*<sup>3</sup> prior dominates the *π*<sup>2</sup> prior in terms of evidence, for almost all values of *K*, indicating that *π*<sup>3</sup> places its mass in regions of higher likelihood or, equivalently, that the data were much more probable under *π*<sup>3</sup> than *π*2. This implies that the hint of *π*<sup>3</sup> at more structure beyond *K* > *K*max should be taken serious–we investigate this in Section 5.2.

Arrows pointing *northward* indicate *H*3(*K*) > *H*2(*K*), i.e., *π*<sup>3</sup> is *less* informative than *π*2, since more information is gained by updating from *π*<sup>3</sup> to *P*<sup>3</sup> than from *π*<sup>2</sup> to *P*2. It is observed that *π*<sup>2</sup> and *π*<sup>3</sup> are roughly comparable in terms of (un)informativeness.

#### *5.2. Experiment II: 'Free' Analysis*

We now freely look for more structure in the data by letting *K* vary up until *K*max = 10. This goes beyond the capacities of *π*<sup>1</sup> (because of the label switching problem) and *π*<sup>2</sup> (because no data are available to set the (*a*, *b*) bounds). Thus, the great advantage of *π*<sup>3</sup> is that we can use a simple heuristic to set *x***<sup>0</sup>** and let the model perform the discovering without worrying about convergence issues or the obtained evidence values. The bottom row in Figure 4a shows that model selection for the VTR problem is well-defined, with the most probable values of *K* ≤ 10, except for *D* = until. That case is investigated in Figure 3, where the need for more VTRs (higher *K*) is apparent from the unmodeled broad peak centered at around 3000 Hz in the FFT power spectrum (right panel). Incidentally, this spectrum also shows that spectral peaks are often resolved into more than one VTR, which underlines the importance of using a prior that enables trouble-free handling of multiplets of arbitrary order. A final observation from the spectrum is the fact that the inferred *x*ˆ*<sup>k</sup>* differs substantially from the supplied values in *x* (Table 1), which hints at the weak inductive bias underlying *π*3.

#### **6. Discussion**

It is only when the information in the prior is comparable to the information in the data that the prior probability can make any real difference in parameter estimation problems or in model selection problems ([32], p. 9).

Although the weakly informative prior for resonance frequencies *π*<sup>3</sup> is meant to be overwhelmed, its practical advantage (i.e., solving the label switching problem) will nonetheless persist, making a real difference in model selection problems even when "the information in the prior" is much smaller than "the information in the data". In this sense, *π*<sup>3</sup> is quite unlike the prior referenced in the above quote. Since it will be overwhelmed, all it has to do is provide a reasonable density everywhere (which it does), be easily parametrizable (which it is), and be easy to sample from (which it is).

Thus, we hope that this prior can enable the use of robust evidence-based methods for a new class of problems, even in the presence of multiplets of arbitrary order. The prior is compatible with off-the-shelf exploration algorithms and solves the label switching problem without any special tuning or post processing. It would be interesting to compare it to other approaches, e.g., [33], especially in terms of exploration efficiency. It is valid for any collection of scale variables that is intrinsically ordered, of which frequencies and wavelengths seem to be the most natural examples. Some examples of recent work where the prior could be applied directly are:


**Author Contributions:** Conceptualization, writing, methodology, and analysis: M.V.S. Supervision: B.d.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Flemish AI plan and by the Research Foundation Flanders (FWO) under grant number G015617N.

**Acknowledgments:** We would like to thank Roxana Radulescu, Timo Verstraeten, and Yannick Jadoul for helpful comments.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Proceeding Paper* **An Entropic Approach to Classical Density Functional Theory †**

**Ahmad Yousefi \* and Ariel Caticha**

Physics Department, University at Albany-SUNY, Albany, NY 12222, USA; acaticha@albany.edu


**Abstract:** The classical Density Functional Theory (DFT) is introduced as an application of entropic inference for inhomogeneous fluids in thermal equilibrium. It is shown that entropic inference reproduces the variational principle of DFT when information about the expected density of particles is imposed. This process introduces a family of trial density-parametrized probability distributions and, consequently, a trial entropy from which the preferred one is found using the method of Maximum Entropy (MaxEnt). As an application, the DFT model for slowly varying density is provided, and its approximation scheme is discussed.

**Keywords:** entropic inference; relative entropy; density functional theory; contact geometry; optimal approximations

#### **1. Introduction**

The Density Functional Theory was first developed in the context of quantum mechanics and only later extended to the classical regime. The theory was first introduced by Kohn and Hohenberg (1964) [1] as a computational tool to calculate the spatial density of an electron gas in the presence of an external potential at zero temperature. Soon afterwards, Mermin provided the extension to finite temperatures [2]. Ebner, Saam, and Stroud (1976) [3] applied the idea to simple classical fluids, and Evans (1979) provided a systematic formulation in his classic paper [4]: "The nature of the liquid–vapour interface and other topics in the statistical mechanics of non-uniform, classical fluids".

The majority of physicists and chemists today are aware of the quantum DFT and the Kohn–Sham model [5], while fewer are familiar with the classical DFT; a historical review of quantum DFT and its vast variety of applications is found in [6,7]. The classical DFT, similarly, is a "formalism designed to tackle the statistical mechanics of inhomogeneous fluids" [8], which has been used to investigate a wide variety of equilibrium phenomena, including surface tension, adsorption, wetting, fluids in porous materials, and the chemical physics of solvation.

Just as the Thomas–Fermi–Dirac theory is usually regarded as a precursor of quantum DFT, van der Waals' thermodynamic theory of capillarity under the hypothesis of a continuous variation of density [9] can be regarded as the earliest work on classical DFT without a fundamental proof of existence for such a variational principle.

"The long-term legacy of DFT depends largely on the continued value of the DFT computer programs that practitioners use daily" [6]. The algorithms behind the computer programs, all starting from an original Hartree–Fock method to solve the N-particle Schrödinger equation, have evolved with many approximations and extensions implemented over time by a series of individuals; although the algorithms produce accurate results, they do not mention the HK variational principle. Without the variational principle, the computer codes are suspected of being ad hoc or intuitively motivated without a solid theoretical foundation; therefore, the DFT variational principle not only scientifically justifies the DFT algorithms, but it also provides us with a basis to understand the repeatedly modified algorithms behind the codes.

**Citation:** Yousefi, A.; Caticha, A. An Entropic Approach to Classical Density Functional Theory. *Phys. Sci. Forum* **2021**, *3*, 13. https://doi.org/ 10.3390/psf2021003013

Academic Editors: Wolfgang von der Linden and Sascha Ranftl

Published: 24 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

In this work, we derive the classical DFT as an application of the method of maximum entropy [10–14]. This integrates the classical DFT with other formalisms of classical statistical mechanics (canonical, grand canonical, etc.) as an application of information theory. Our approach not only enables one to understand the theory from the Bayesian point of view, but also provides a framework to construct equilibrium theories on the foundation of MaxEnt. We emphasize that our goal is not derive an alternative to DFT. Our goal is purely conceptual. We wish to find a new justification or derivation of DFT that makes it explicit how DFT fits within the MaxEnt approach to statistical mechanics. The advantage of such an understanding is the potential for future applications that are outside the reach of the current version of DFT.

In Section 2, we briefly review entropic inference as an inference tool that updates probabilities as degrees of rational belief in response to new information. Then, we show that any entropy maximization produces a contact structure that is invariant under the Legendre Transformations; this enables us to take advantage of these transformations for maximized entropy functions (functionals here) found from constraints other than those of thermal equilibrium, as well as thermodynamic potentials.

In Section 3, we briefly review the method of relative entropy for optimal approximation of probabilities, which allows us to derive and then generalize the Bogolyubov variational principle. Then, we apply it for the special case wherein the trial family of probabilities are parametrized by the density function *n*(*x*).

In Section 4, the Density Functional formalism is introduced as an extension of the existing ensemble formalisms of statistical mechanics (canonical, grand canonical, etc.), and we show that the core DFT theorem is an immediate consequence of MaxEnt; we prove that in the presence of an external potential *v*(*x*), there exists a trial density functional entropy *Sv*(*E*; *n*] maximized at the equilibrium density. We also prove that this entropy maximization is equivalent to minimization of a density functional potential Ω(*β*; *n*] given by

$$
\Omega\_{\mathbb{F}}(\beta; n] = \int d^3x v(\mathbf{x}) n(\mathbf{x}) + F(\beta; n] \tag{1}
$$

where *F*(*β*; *n*] is independent of *v*(*x*). This formulation achieves two objectives. (**i**) It shows that the density functional variational principle is an application of MaxEnt for non-uniform fluids at equilibrium, and therefore, varying the density *n*(*x*) in *Sv*(*E*; *n*] **does not** imply that the functional represents entropy of any non-equilibrium system. This trial entropy, although very useful, is just a mathematical construct that allows us to incorporate constraints that are related to one another by definition. (**ii**) By this approach, we show that the Bayesian interpretation of probability liberates the fundamental theorem of the DFT from an imaginary grand-canonical ensemble, i.e., the thermodynamic chemical potential is appropriately defined without the need to define microstates for varying numbers of particles.

Finally, in Section 5, as an illustration, we discuss the already well-known example of a slowly varying inhomogeneous fluid. We show that our entropic DFT allows us to reproduce the gradient approximation results derived by Evans [4]. There are two different approximations involved: (**i**) rewriting the non-uniform direct correlation function in terms of the uniform one and (**ii**) the use of linear response theory to evaluate the Fourier transform of direct correlation functions. The former assumes that the density is uniform inside each volume element, and the latter assumes that in of densities for neighboring volume elements is small compared to their average.

#### **2. Entropic Inference**

A discussion of the method of maximum entropy as a tool for inference is found in [14]. Given a **prior** Probability Distribution Function (**PDF**) *Q*(*X*), we want to find the **posterior** PDF *P*(*X*) subject to constraints on expected values of functions of *X*.

Formally, we need to maximize the relative entropy

$$\mathcal{S}\_r[P|Q] \equiv -\sum\_X (P \log P - P \log Q) \,\,,\tag{2}$$

under constraints *Ai* = ∑*<sup>X</sup> P*(*X*)*A*ˆ*i*(*X*) and 1 = ∑*<sup>X</sup> P*(*X*), where *Ai*s are real numbers, *A*ˆ*i*<sup>s</sup> are real-valued functions on the space of *X*, and 1 ≤ *i* ≤ *m* for *m* number of constraints. The maximization process yields the posterior probability

$$P(X) = Q(X)\frac{1}{Z}e^{-\sum\_{i} a\_{i}\hat{A}\_{i}(X)},\qquad\text{where }Z = \sum\_{X} Q(X)e^{-\sum\_{i} a\_{i}\hat{A}\_{i}(X)},\tag{3}$$

and *αi*s are Lagrange multipliers associated with *Ai*s. Consequently, the maximized entropy is

$$S = \sum\_{i} \mathfrak{a}\_{i} A\_{i} + \log Z \,. \tag{4}$$

Now we can show that the above entropy readily produces a contact structure, we can calculate the complete differential of Equation (4) to define the vanishing one-form *ωcl* as

$$
\omega\_{cl} \equiv dS - \sum\_{i} \alpha\_{i} dA\_{i} = 0 \; . \tag{5}
$$

Therefore, any classical entropy maximization with *m* constraints produces a contact structure {T, *<sup>ω</sup>cl*} in which manifold <sup>T</sup> has 2*<sup>m</sup>* <sup>+</sup> 1 coordinates {*q*0, *<sup>q</sup>*1,... *qm*, *<sup>p</sup>*1,..., *pm*}.

The physically relevant manifold M is an *m*-dimensional sub-manifold of T on which *<sup>ω</sup>cl* vanishes; i.e., M is determined by 1 + *<sup>m</sup>* equations:

$$q\_0 \equiv S(\{A\_i\})\,, \qquad p\_{\bar{i}} \equiv \alpha\_{\bar{i}} = \frac{\partial S}{\partial A\_{\bar{i}}}\,. \tag{6}$$

The Legendre Transformations, which are defined as

$$q\_0 \longrightarrow q\_0 - \sum\_{j=1}^{l} p\_j q\_j \, \text{ .} \tag{7}$$

$$q\_i \longrightarrow p\_i \,, \quad p\_i \longrightarrow -q\_i \,, \quad \text{ for } 1 \le i \le l \text{ } \cdot.$$

are coordinate transformations on space T under which *<sup>ω</sup>cl* is conserved. It has been shown [15,16] that the laws of thermodynamics produce a contact structure conforming to the above prescription. Here, we are emphasizing that the contact structure is an immediate consequence of MaxEnt, and therefore, it can be utilized in applications of information theory beyond thermodynamics.

#### **3. MaxEnt and Optimal Approximation of Probabilities**

The posterior PDF found from entropic inference is usually too complicated to be used for practical purposes. A common solution is to approximate the posterior PDF with a more tractable family of PDFs {*pθ*} [17]. Given the exact probability *p*0, the preferred member of tractable family *pθ*<sup>∗</sup> is found by maximizing the entropy of *p<sup>θ</sup>* relative to *p*0:

$$\frac{\delta S\_{\mathbb{P}}\left[p\_{\theta}\vert p\_{0}\right]}{\delta \theta}\Big|\_{\theta=\theta^{\*}} = 0 \; . \tag{8}$$

The density functional formalism is a systematic method in which the family of trial probabilities is parametrized by the density of particles; in Section 4, we shall use the method of maximum entropy to determine the family of trial distributions parametrized by *n*(*x*), *p<sup>θ</sup>* ≡ *pn*. So, we can rewrite Equation (8) as

$$\frac{\delta}{\delta n(\mathbf{x'})} \left[ S\_r[p\_n|p\_0] + \alpha\_{\text{eq}}[N-\int d^3 \mathbf{x} n(\mathbf{x})] \right]\_{n(\mathbf{x}) = n\_{\text{eq}}(\mathbf{x})} = 0 \,. \tag{9}$$

We will see that the canonical distribution itself is a member of the trial family; therefore, in this case, the exact solution to Equation (9) is *p*<sup>0</sup> itself:

$$\left.p\_n\right|\_{n(\mathbf{x})=n\_{\mathbf{q}}(\mathbf{x})} = \left.p\_0\right|\,. \tag{10}$$

#### **4. Density Functional Formalism**

An equilibrium formalism of statistical mechanics is a relative entropy maximization process consisting of three crucial elements: (i) One must choose the microstates that describe the system of inference. (ii) The prior is chosen to be uniform. (iii) One must select the constraints that represent the information that is relevant to the problem at hand.

In the density ensemble, microstates of the system are given as positions and momenta of all *N* particles of the same kind, given the uniform prior probability distribution

$$Q(\{\vec{x}\_1, \dots, \vec{x}\_N; \vec{p}\_1, \dots, \vec{p}\_N\}) = constant. \tag{11}$$

Keeping in mind that we are looking for **thermal properties of inhomogeneous fluids**, it is natural to choose the density of particles *n*(*x*) as a computational constraint and the expected energy *E* as a thermodynamic constraint, in which *n*(*x*) represents the **inhomogeneity** and *E* defines the thermal equilibrium.

Note that all constraints (computational, thermal, etc.) in the framework can be incorporated as inferential constraints and can be imposed as prescribed in Section 2.

The density constraint holds for every point in space; therefore, we have 1 + 1 + R<sup>3</sup> constraints—one for normalization, one for total energy, and one for density of particles at each point in space; thus, we have to maximize the relative entropy

$$S\_{\rm r}[P|Q] = -\frac{1}{N!} \int d^{3N} \mathbf{x} d^{3N} p (P \log P - P \log Q) \equiv -\text{Tr}\_{\varepsilon}(P \log P - P \log Q), \tag{12}$$

subject to constraints

$$1 = \langle 1 \rangle, \quad E = \langle \hat{H}\_{\text{v}} \rangle,\tag{13a}$$

$$m(\mathbf{x}) = \langle \hbar\_x \rangle \quad \text{where} \int d^3x n(\mathbf{x}) = N\_\prime \tag{13b}$$

where .<sup>≡</sup> <sup>1</sup> *N*! (.)*Pd*<sup>3</sup>*Nxd*<sup>3</sup>*<sup>N</sup> p*. The classical Hamiltonian operator *H*ˆ and the particle density operator *n*ˆ *<sup>x</sup>* are given as

$$\hat{H}\_{\upsilon} \equiv \sum\_{i=1}^{N} v(\mathbf{x}\_{i}) + \hat{\mathcal{K}}(p\_{1}, \dots, p\_{N}) + \hat{\mathcal{U}}(\mathbf{x}\_{1}, \dots, \mathbf{x}\_{N}), \tag{14}$$

$$\hat{m}\_{\mathbf{x}} \equiv \sum\_{i}^{N} \delta(\mathbf{x} - \mathbf{x}\_{i}). \tag{15}$$

The density *n*(*x*) is not an arbitrary function; it is constrained by a fixed total number of particles,

$$\int d^3x n(\mathbf{x}) = N. \tag{16}$$

Maximizing (12) subject to (13) gives the posterior probability *P*(*x*1, ... , *xN*; *p*1, ... , *pN*) as

$$P = \frac{1}{Z\_{\vartheta}} e^{-\beta \hat{H}\_{\vartheta} - \int d^3 x \mathbf{a}(\mathbf{x}) \hbar\_{\mathbf{x}}} \quad \text{under condition} \\ \int d^3 x n(\mathbf{x}) = N. \tag{17}$$

where *α*(*x*) and *β* are Lagrange multipliers.

The Lagrange multiplier function *α*(*x*) is implicitely determined by

$$\frac{\delta \log Z\_v}{\delta a(x)} = -n(x) \tag{18}$$

and by Equation (16)

$$-\int d^3x \frac{\delta \log Z\_{\upsilon}}{\delta \kappa(x)} = N \,. \tag{19}$$

Substituting the trial probabilities from (17) into (12) gives the **trial** entropy *Sv*(*E*;*n*] as

$$S\_{\upsilon}(E;n] = \beta E + \int d^3x \mathfrak{a}(\mathbf{x})\mathfrak{n}(\mathbf{x}) + \log \mathcal{Z}\_{\upsilon} \tag{20}$$

where *Zv*(*β*; *α*] is the **trial** partition function defined as

$$Z\_{\upsilon}(\beta; a) = Tr\_{\upsilon} e^{-\beta \hat{H}\_{\upsilon} - \int d^3 x a(x) \mathbb{1}\_{\mathbb{X}}}.\tag{21}$$

The equilibrium density *neq*(*x*) is that which maximizes *Sv*(*E*;*n*] subject to *d*3*xn*(*x*) =

*N*:

$$\frac{\delta}{\delta n(\mathbf{x'})} \left[ S\_v(E; n] + \alpha\_{cq} [N - \int d^3 \mathbf{x} n(\mathbf{x})] \right] = 0 \quad \text{for fixed } E. \tag{22}$$

Next, perform a Legendre transformation and define the Massieu functional *S*˜ *<sup>v</sup>*(*β*, *n*] as

$$\mathcal{S}\_v \equiv \mathcal{S}\_v - \beta E\_\prime \tag{23}$$

so that we can rewrite Equation (22) as

$$\frac{\delta}{\delta n(\mathbf{x'})} \left[ S\_v(\beta; n] - \alpha\_{c\eta} \int d^3 x n(\mathbf{x}) \right] = 0 \quad \text{for fixed } \beta \text{ .} \tag{24}$$

Combine (20), (23), and (24) and use the variational derivative identity *<sup>δ</sup>n*(*x*) *δn*(*x*-) = *δ*(*x* − *x*- ) to find

$$\int d^3x \left[ \frac{\delta \log Z\_v(\boldsymbol{\beta}; \boldsymbol{a})}{\delta \boldsymbol{a}(\boldsymbol{\alpha})} + n(\boldsymbol{x}) \right] \frac{\delta \boldsymbol{a}(\boldsymbol{x})}{\delta n(\boldsymbol{x}')} = \boldsymbol{a}\_{\text{eq}} - \boldsymbol{a}(\boldsymbol{x}'). \tag{25}$$

The LHS of equation (25) vanishes by (16), and therefore, the RHS must vanish for an arbitrary *n*(*x*), which implies that

$$a(\mathbf{x}) = a\_{c\eta} \quad \text{and} \quad \frac{\delta \log Z\_v}{\delta a(\mathbf{x})} \Big|\_{a\_{v\eta}} = -n\_{c\eta}(\mathbf{x}) \,. \tag{26}$$

Substituting (26) into (17) yields the equilibrium probability distribution

$$P^\*(\mathbf{x}\_1, \dots, \mathbf{x}\_N; p\_1, \dots, p\_N) = \frac{1}{Z\_v^\*} e^{-\beta \hat{H}\_v - \mathbf{a}\_{eq} \int d^3 \mathbf{x} \hbar\_x} = \frac{1}{Z\_v^\*} e^{-\beta \hat{H}\_v - \mathbf{a}\_{eq} \cdot \mathbf{N}} \tag{27}$$

where *Z*∗ *<sup>v</sup>* (*β*, *<sup>α</sup>eq*) = *Trce*−*βH*<sup>ˆ</sup> *<sup>v</sup>*−*αeqN*.

From the inferential point of view, **the variational principle for the grand potential and the equilibrium density** [4] is proved at this point; we showed that for an arbitrary classical Hamiltonian *H*ˆ *<sup>v</sup>*, there exists a trial entropy *Sv*(*E*; *n*(*x*)] defined by Equation (20), which assumes its maximum at fixed energy and varying *n*(*x*) under the condition *d*3*xn*(*x*) = *N* at the equilibrium density and gives the posterior PDF (Equation (27)) equal to that of the canonical distribution.

The massieu function *S*˜ *<sup>v</sup>*(*β*; *n*(*x*)] from Equation (23) defines the **density functional potential** Ω*v*(*β*; *n*(*x*)] by

$$\Omega\_{\upsilon}(\beta; n] \equiv \frac{-\mathcal{S}\_{\upsilon}(\beta; n]}{\beta} = -\int d^3 x \frac{\mathfrak{a}(\mathbf{x})}{\beta} n(\mathbf{x}) - \frac{1}{\beta} \log Z\_{\upsilon}(\beta; n] \,, \tag{28}$$

so that the maximization of *Sv*(*E*; *n*] (20) in the vicinity of the equilibrium is equivalent to the minimization of Ω*v*(*β*; *n*(*x*)] (28) around the same equilibrium point

$$\frac{\delta}{\delta n(\mathbf{x'})} \left[ \Omega\_v(\beta; n] + \frac{\mathfrak{a}\_{cq}}{\beta} \int d^3 \mathbf{x} n(\mathbf{x}) \right] = 0 \; . \tag{29}$$

After we find <sup>Ω</sup>*v*, we just need to recall that *<sup>α</sup>*(*x*) = <sup>−</sup>*<sup>β</sup> <sup>δ</sup>*Ω*<sup>v</sup> <sup>δ</sup>n*(*x*) and substitute in Equation (26) to recover the "core integro-differential equation" [4] of the DFT as

$$\nabla \left( \frac{\delta \Omega(\beta; n]}{\delta n(\mathbf{x})} \right)\_{eq} = 0 \tag{30}$$

Ω*v*;*eq* ≤ Ω*v*(*β*; *n*], (31)

which implies that

where

$$
\Omega\_{\mathbf{v},\mathbf{cq}}(\beta;\mathfrak{u}) = -\frac{\varkappa\_{\mathbf{cq}}}{\beta} \int d^3 \mathbf{x} n(\mathbf{x}) - \frac{1}{\beta} \log Z\_v^\*(\beta, \mathfrak{a}\_{\mathbf{cq}}).\tag{32}
$$

From Equation (28), it is clear that

$$
\Omega\_{\mathbb{U}}(\beta; n] = \int d^3x v(\mathbf{x}) n(\mathbf{x}) + \langle \hat{\mathbb{K}} + \hat{\mathbb{U}} \rangle - \frac{S\_v(\mathbf{E}; n]}{\beta} \,. \tag{33}
$$

It is convenient to define the **intrinsic density functional potential** *Fv* as

$$F\_{\upsilon}(\beta; n] \equiv \langle \mathcal{K} + \mathcal{U} \rangle - \frac{S\_{\upsilon}}{\beta} \tag{34}$$

to have

$$
\Omega\_v(\beta; n] = \int v(\mathbf{x}) n(\mathbf{x}) + F\_v(\beta; n] \,. \tag{35}
$$

Now we are ready to restate the fundamental theorem of the classical DFT:

**Theorem 1.** *The intrinsic functional potential Fv is a functional of density n*(*x*) *and is independent of the external potential:*

$$\frac{\delta F\_{\upsilon}(\beta; n]}{\delta v(\mathbf{x'})} = 0 \quad \text{for fixed } \beta \text{ and } n(\mathbf{x}). \tag{36}$$

**Proof.** The crucial observation behind the DFT formalism is that *P* and *Zv* depend on the external potential *v*(*x*) and the Lagrange multipliers *α*(*x*) only through the particular combination *α*¯(*x*) ≡ *βv*(*x*) + *α*(*x*). Substitute Equation (20) into (34) to get

$$
\beta F\_{\mathbb{P}}(\beta; n] = \log Z(\beta; \mathbb{A}) + \int d^3 x \mathbb{A}(x) n(x),\tag{37}
$$

where *<sup>Z</sup>*(*β*; *<sup>α</sup>*¯] = *Zv*(*β*; *<sup>α</sup>*] = *Trce*−*β*(*K*ˆ+*U*<sup>ˆ</sup> )− *<sup>d</sup>*3*xα*¯(*x*)*n*<sup>ˆ</sup> *<sup>x</sup>* . The functional derivative of *<sup>β</sup>Fv* at fixed *n*(*x*) and *β* is

$$\frac{\delta \left(\beta F\_{\upsilon}(\boldsymbol{\beta};\boldsymbol{n})\right)}{\delta \upsilon(\mathbf{x}')}\Big|\_{\boldsymbol{\beta},n(\boldsymbol{x})} = \int d^3 \mathbf{x}' \frac{\delta}{\delta \bar{\mathfrak{a}}(\mathbf{x}'')} \left[\log Z(\boldsymbol{\beta};\boldsymbol{\bar{n}}) + \int d^3 \mathbf{x} \bar{\mathfrak{a}}(\mathbf{x}) n(\mathbf{x})\right] \frac{\delta \mathfrak{a}(\mathbf{x}'')}{\delta \upsilon(\mathbf{x}')}\Big|\_{\boldsymbol{\beta},n(\boldsymbol{x})}.\tag{38}$$

Since *n*(*x*- ) = <sup>−</sup>*δlogZ*(*β*;*α*¯] *δα*¯(*x*-) , keeping *n*(*x*) fixed is achieved by keeping *α*¯(*x*) fixed:

$$\frac{\delta\ddot{a}(\mathbf{x}^{\prime\prime})}{\delta v(\mathbf{x}^{\prime})}\Big|\_{\beta,n(x)} = \frac{\delta\ddot{a}(\mathbf{x}^{\prime\prime})}{\delta v(\mathbf{x}^{\prime})}\Big|\_{\beta,n(x)} = 0 \,,\tag{39}$$

so that

$$\frac{\delta F\_v(\beta, n)}{\delta v(\mathbf{x'})} \Big|\_{\beta, n(\mathbf{x})} = 0 \,, \tag{40}$$

which concludes the proof; thus, we can write down the intrinsic potential as

$$F(\beta, n(\mathbf{x})) = F\_{\mathcal{V}}(\beta, n(\mathbf{x})) \,. \tag{41}$$

**Remark 1.** *Note that since a change in the external potential v*(*x*) *can be compensated by a suitable change in the multiplier α*(*x*) *in such a way as to keep α*¯(*x*) *fixed, such changes in v*(*x*) *will have no effect on n*(*x*)*. Therefore, keeping n*(*x*) *fixed on the left-hand side of (37) means that α*¯(*x*) *on the right side is fixed too.*

Now, we can substitute Equation (35) into (29) and define the chemical potential

$$
\mu \equiv \frac{-\varkappa\_{eq}}{\beta} \; , \tag{42}
$$

to have

$$\frac{\delta}{\delta n(\mathbf{x}')} \left[ \int d^3 \mathbf{x} v(\mathbf{x}) n(\mathbf{x}) + F(\beta; n] - \mu \int d^3 \mathbf{x} n(\mathbf{x}) \right]\_{n(\mathbf{x}) = n\_{\mathbf{q}}(\mathbf{x})} = 0 \,. \tag{43}$$

We can also substitute (35) into (30) to find

$$v(\mathbf{x}) + \frac{\delta F}{\delta n(\mathbf{x})}\Big|\_{eq} = \mu\_\prime \tag{44}$$

which allows us to define and **interpret** *<sup>μ</sup>in*(*x*; *<sup>n</sup>*] <sup>≡</sup> *<sup>δ</sup><sup>F</sup> <sup>δ</sup>n*(*x*) as the **intrinsic chemical potential** of the system. To proceed further, we also split *F* into that of ideal gas plus the interaction part as

$$F(\beta; n] = F\_{id}(\beta; n] - \phi(\beta; n]. \tag{45}$$

Differentiating with *<sup>δ</sup> <sup>δ</sup>n*(*x*) gives

$$
\beta \mu\_{in}(\mathbf{x}; n) = \log(\lambda^3 n(\mathbf{x})) - c(\mathbf{x}; n) \,, \tag{46}
$$

where, for a monatomic gas, *λ* = 2*πh*¯ <sup>2</sup> *m* 1/2 . The additional one-body potential *c*(*x*; *n*] = *δφ <sup>δ</sup>n*(*x*) is related to the Ornstein–Zernike direct correlation function of non-uniform fluid [18–20] by

$$\alpha^{(2)}(\mathbf{x}, \mathbf{x}'; n) \equiv \frac{\delta c(\mathbf{x}; n)}{\delta n(\mathbf{x}')} = \frac{\delta^2 \phi(\boldsymbol{\beta}; n)}{\delta n(\mathbf{x}) \delta n(\mathbf{x}')} \,. \tag{47}$$

#### **5. Slowly Varying Density and Gradient Expansion**

We have proved that the solution to Equation (43) is the equilibrium density. However, the functional *F*(*β*; *n*] needs to be approximated because the direct calculation of *F* involves calculating the canonical partition function, the task that we have been avoiding to begin with. Therefore, different models of the DFT may vary in their approach to guessing *F*(*β*; *n*]. Now assume that we are interested in a monatomic fluid with a slowly varying external potential. In our language, it means that we use the approximation *<sup>d</sup>*3*<sup>x</sup>* <sup>≡</sup> <sup>∑</sup>(Δ*x*)3, where Δ*x* is much longer than the density correlation length, and the change in density in each volume element is small compared to the average density. This allows us to interpret each volume element (Δ*x*)<sup>3</sup> as a fluid at grand canonical equilibrium with the rest of the fluid as its thermal and particle bath.

Similarly to [4], we expand *F*(*β*; *n*] as

$$F(\beta; n) = \int d^3 \mathbf{x} \left[ f\_0(n(\mathbf{x})) + f\_2(n(\mathbf{x})) |\nabla n(\mathbf{x})|^2 + \mathcal{O}(\nabla^4 n(\mathbf{x})) \right]. \tag{48}$$

Differentiating with respect to *n*(*x*), we have

$$\mu\_{in}(\mathbf{x};n) = \frac{\delta F}{\delta n(\mathbf{x})} = f\_0'(n(\mathbf{x})) - f\_2'(n(\mathbf{x}))|\nabla n(\mathbf{x})|^2 - 2f\_2(n(\mathbf{x}))\nabla^2 n(\mathbf{x}).\tag{49}$$

In the absence of an external potential, *v*(*x*) = 0, the second and the third terms in the RHS of (49) vanish, and from Equation (44), *μin* = *μ*; therefore, we have

$$f\_0'(n) = \mu(n(x)),\tag{50}$$

where *μ*(*n*(*x*)) is the chemical potential of a uniform fluid with density *n* = *n*(*x*). On the other hand, with the assumption that each volume element behaves as if it is in grand canonical ensemble for itself under influence of both external potential and additional one-body interaction *c*(*x*; *n*], we know that the second derivative of *F* is related to Ornstein– Zernike theory by

$$
\beta \frac{\delta^2 F}{\delta n(\mathbf{x}) \delta n(\mathbf{x}')} = \frac{\delta(\mathbf{x} - \mathbf{x}')}{n(\mathbf{x})} - \varepsilon^{(2)}(\mathbf{x}, \mathbf{x}'; n) \tag{51}
$$

Therefore we have a Taylor expansion of *F* around the uniform density as

$$\begin{split} \mathbb{E}\left[\mathbf{n}(\mathbf{x})\right] &= \mathbb{E}[\mathbf{n}] + \int d^3 \mathbf{x} \left[\frac{\delta F}{\delta \mathbf{n}(\mathbf{x})}\right]\_{n\_{\mathbf{q}}(\mathbf{x})} \boldsymbol{\bar{n}}(\mathbf{x}) \\ &+ \frac{1}{2\beta} \int \int d^3 \mathbf{x} d^3 \mathbf{x}' \Big[ \frac{\delta(\mathbf{x} - \mathbf{x}')}{n(\mathbf{x})} - c^{(2)}(|\mathbf{x} - \mathbf{x}'|; \boldsymbol{n}) \Big]\_{n\_{\mathbf{q}}(\mathbf{x})} \boldsymbol{\bar{n}}(\mathbf{x}) \boldsymbol{\bar{n}}(\mathbf{x}') + \dots \end{split} \tag{52}$$

where *<sup>n</sup>*˜(*x*) <sup>≡</sup> *<sup>n</sup>*(*x*) <sup>−</sup> *<sup>n</sup>*, and *<sup>c</sup>*(2)(|*<sup>x</sup>* <sup>−</sup> *<sup>x</sup>*- |; *n*] is the direct correlation function of a uniform fluid with density *n* = *n*(*x*). The Fourier transform of the second integral in (52) gives

$$\frac{1}{2\beta} \int \int d^3x d^3x' \left[ \frac{\delta(\mathbf{x} - \mathbf{x'})}{n(\mathbf{x})} - c^{(2)} \left( |\mathbf{x} - \mathbf{x'}|; n \right) \right]\_{n\_\mathbf{q}(\mathbf{x})} \vec{n}(\mathbf{x}) \vec{n}(\mathbf{x'}) \tag{53}$$
 
$$= \frac{-1}{2\beta V} \sum\_q \left( c^{(2)}(q; n) - \frac{1}{n(q)} \right) \vec{n}(q) \vec{n}(-q) \ ,$$

and comparing with (48) yields

$$f\_0^{\prime\prime}(n) = \frac{-1}{\beta} (a(n) - \frac{1}{n}) \quad , \qquad f\_2(n) = \frac{-b(n)}{2\beta} \,. \tag{54}$$

where the functions *a*(*n*) and *b*(*n*) are defined as coefficients of the Fourier transform of the Ornstein–Zernike direct correlation function by *c*2(*q*; *n*] = *a*(*n*(*q*)) + *b*(*n*(*q*))*q*<sup>2</sup> + .... *b*(*n*) is evaluated with linear response theory to find that

$$f\_2(n(x)) = \frac{1}{12\beta} \int d^3x' |x - x'|^2 c^{(2)}(|x - x'|; n). \tag{55}$$

We can substitute Equations (55) and (50) into (49) and use the equilibrium identity ∇*μ* = 0 to find the integro-differential equation

$$\nabla \left[ v(\mathbf{x}) + \mu(n(\mathbf{x})) - f\_2'(n(\mathbf{x})) |\nabla n(\mathbf{x})|^2 - 2f\_2(n(\mathbf{x})) \nabla^2 n(\mathbf{x}) \right]\_{n(\mathbf{x}) = n\_{\text{eq}}(\mathbf{x})} = 0,\tag{56}$$

which determines the equilibrium density *n*ˆ*eq*(*x*) in the presence of external potential *v*(*x*) given the Ornstein–Zernike direct correlation function of uniform fluid *<sup>c</sup>*(2)[*n*(*x*), <sup>|</sup>*<sup>x</sup>* <sup>−</sup> *<sup>x</sup>*- |].

#### **6. Conclusions**

We have shown that the variational principle of classical DFT is a special case of applying the method of maximum entropy to construct optimal approximations in terms of the variables that capture the relevant physical information, namely, the particle density *n*(*x*). It is worth emphasizing once again: In this paper, we have pursued the purely conceptual goal of finding how the DFT fits within the MaxEnt approach to statistical mechanics. The advantage of achieving such an insight is the potential for future applications that lie outside the reach of the current versions of DFT. As an illustration, we have discussed the already well-known example of a slowly varying inhomogeneous fluid. Future research can be pursued in three different directions: (**i**) To show that the method of maximum entropy can also be used to derive the quantum version of DFT, (**ii**) to approach the Dynamic DFT [21], generalizing the idea to non-equilibrium systems by following the theory of maximum caliber [22], and (**iii**) to revisit the objective of Section 5 and construct weighted DFTs [23,24] by using the method of maximum entropy.

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Physical Sciences Forum* Editorial Office E-mail: psf@mdpi.com www.mdpi.com/journal/psf

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18

www.mdpi.com

ISBN 978-3-0365-3201-1