*Article* **Asymptotic Efficiency of Point Estimators in Bayesian Predictive Inference**

**Emanuele Dolera 1,2**


**Abstract:** The point estimation problems that emerge in Bayesian predictive inference are concerned with random quantities which depend on both observable and non-observable variables. Intuition suggests splitting such problems into two phases, the former relying on estimation of the random parameter of the model, the latter concerning estimation of the original quantity from the distinguished element of the statistical model obtained by plug-in of the estimated parameter in the place of the random parameter. This paper discusses both phases within a decision theoretic framework. As a main result, a non-standard loss function on the space of parameters, given in terms of a Wasserstein distance, is proposed to carry out the first phase. Finally, the asymptotic efficiency of the entire procedure is discussed.

**Keywords:** asymptotic efficiency; bayesian predictive inference; compatibility equations; decision theory; de Finetti's representation theorem; exchangeability; Wasserstein distance

**MSC:** 62A01; 62C10; 62C12; 60F17

## **1. Introduction**

This paper carries on a project—conceived by Eugenio Regazzini some years ago, and partially developed in collaboration with Donato M. Cifarelli—which aims at proving why and how some classical, frequentist algorithms from the theory of point estimation can be justified, under some regularity assumptions, within the Bayesian framework. See [1–4]. This project was inspired, in turn, by the works and the thoughts of Bruno de Finetti about the foundation of statistical inference, substantially based on the following principles.


Indeed, these principles do not force the assessment of a specific prior distribution, but just lead the statistician to take cognizance that some prior has, in any case, to exist.

**Citation:** Dolera, E. Asymptotic Efficiency of Point Estimators in Bayesian Predictive Inference. *Mathematics* **2022**, *10*, 1136. https:// doi.org/10.3390/math10071136

Academic Editor: Jiancang Zhuang

Received: 1 March 2022 Accepted: 29 March 2022 Published: 1 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

This fact agrees with de Finetti's indication to keep the concepts of "Bayesian standpoint" and "Bayesian techniques" as distinguished. See also [12].

Despite their robust logical coherence, orthodox Bayesian solutions to inferential problems suffer two main drawbacks on the practical, operational side, which may limit their use. On the one hand, it is rarely the case that a prior distribution is fully specified due to a lack of prior information, this phenomenon even being amplified by the choice of complex statistical models (e.g., of nonparametric type). On the other hand, the numerical tractability of the Bayesian solutions often proves to be a serious hurdle, especially in the presence of large datasets. For example, it suffices to mention those algorithms from Bayesian nonparametrics that involve tools from combinatorics (like permutations or set/integer partitions) having exponential algorithmic complexity. See, e.g., [13]. Finally, the implicit nature of the notion of *Bayesian estimator*, although conceptually useful, makes it hard to employ in practical problems, especially in combination with non-quadratic loss functions, even if noteworthy progress has been achieved from the numerical side in the last decade. All these issues still pervade modern statistical literature while, historically, they have paved the way firstly to the "Fisherian revolution" and then to more recent techniques such as empirical Bayes and objective Bayes methods. The ultimate result has been a proliferation of many *ad hoc* algorithms, often of limited conceptual value, that provide focused and operational solutions to very specific problems.

Aware of this trend, Eugenio Regazzini conceived his project with the aims of: reframing the algorithms of modern statistics—especially those obtained by frequentist techniques—within the Bayesian theory as summarized in points 1–3 above, showing whether they can be re-interpreted as good approximations of Bayesian algorithms. The rationale is that orthodox Bayesian theory could be open to accept even non-Bayesian solutions (hence, suboptimal ones if seen "through the glass of the prior") as long as such solutions prove to be more operational than the Bayesian ones and, above all, *asymptotically almost efficient*, in the Bayesian sense. This concept means that, for a fixed prior, the Bayesian risk function evaluated at the non-Bayesian estimator is approximately equal to the overall minimum of such risk function (achieved when evaluated at the Bayesian estimator), the error of approximation going to zero as the sample size increases. Of course, these goals can be carried out after providing quantitative estimates for the risk function, as done, for example, in some decision-theoretic work on the empirical Bayes approach to inference. See, e.g., the seminal work [14]. Indeed, Regazzini's project has much in common with the empirical Bayes theory, although the former strictly remains on the "orthodox Bayesian main way" whilst the latter mixes Bayesian and frequentist techniques. As to more practical results, an archetype of Regazzini's line of reasoning can be found in a previous statement from [15] [Section 5] which proves that the maximum likelihood estimator (MLE)—obtained in the classical context of *n* i.i.d. observations, driven by a regular parametric model—has the same Bayesian efficiency (coinciding with the *mean square error*, in this case) as the Bayesian estimator up to *O*(1/*n*)-terms, provided that the prior is smooth enough. Another example can be found in [16] where the authors, while dealing with species sampling problems, rediscover the so-called Good–Turing estimator for the probability of finding a new species (which is obtained via empirical Bayes arguments) within the Bayesian nonparametric setting described in [17]. Other examples are contained in [2,4]. In any case, Regazzini's project is not only a matter of "rigorously justifying" a given algorithm, but rather of logically conceiving an estimation problem from the beginning to the end by quantifying coherent degrees of approximation in terms of the Bayesian risk or, more generally, in terms of *speed of shrinkage of the posterior distribution* with respect to distances on the space of probability measures, these goals being proved *uniformly with respect to an entire class of priors*. Hence, this plan of action is conceptually antipodal to that of (nowadays called) "Bayesian consistency", i.e., to justify a Bayesian algorithm from the point of view of classical statistics.

## *1.1. Main Contributions and General Strategy*

In this paper, we pursue Regazzini's project by considering some *predictive problems* where the quantity *Un*,*<sup>m</sup>* to be estimated depends explicitly on new (hitherto unobserved) variables *Xn*+1, ... , *Xn*+*m*, possibly besides the original sample variables *X*1, ... , *Xn* and an unobservable parameter *T*. Thus, *Un*,*<sup>m</sup>* = *un*,*m*(*Xn*+1, ... , *Xn*+*m*; *X*1, ... , *Xn*; *T*). For simplicity, we confine ourselves to the simplest case in which both (*X*1, ... , *Xn*) and (*Xn*+1, ... , *Xn*+*m*) are segments of a whole sequence {*Xi*}*i*≥<sup>1</sup> of exchangeable X-valued random variables, while *T* is a random parameter that makes the *Xi*'s conditionally i.i.d. with a common distribution depending on *T*, in accordance with de Finetti's representation theorem. From the statistical point of view, the exchangeability assumption just reveals a supposed homogeneity between the observable quantities while, from a mathematical point of view, it simply states that the joint distribution of any *k*-subset of the *Xi*'s depends only on *k* and not on the specific *k*-subset, for any *k* ∈ N. Thus, we are setting our estimation problem within an orthodox Bayesian framework where, independently of the fact that we are able or not to precisely assess the prior distribution, such a prior has to exist for mere mathematical reasons. This solid theoretical background provides all the elements to logically formulate the original predictive estimation problem as the following decision-theoretic question: find

$$
\hat{\mathcal{U}}\_{n,m} = \text{Argmin}\_{\mathbb{Z}} \mathbb{E}[\mathcal{L}\_{\mathbb{U}}(\mathcal{U}\_{n,m}, Z)] \; , \tag{1}
$$

where: L<sup>U</sup> is a suitable loss function on the space U in which *Un*,*<sup>m</sup>* takes its values; *Z* runs over the space of all U-valued, *σ*(*X*1, ... , *Xn*)-measurable random variables; the expectation is taken with respect to the joint distribution of (*X*1, ... , *Xn*+*m*) and *T*. It is remarkable that the same estimation problem would have been meaningless in classical (Fisherian) statistics, which can solely consider the estimation of (a function of) the parameter, and not of random quantities. Now, the solution displayed in (1) depends of course on the prior and it is the optimal one when seen, in terms of the Bayesian risk, "with the glass of that prior". However, the above-mentioned difficulties about the assessment of a specific prior can diminish the practical (but not the conceptual) value of this solution, in the sense that it could prove to be non-operational in the case of a lack of prior information. Sometimes, when the prior is known up to further unknown parameters, another estimation problem is needed.

Our research is then focused on formalizing a general strategy aimed at producing, under regularity conditions, alternative estimators *U*∗ *<sup>n</sup>*,*<sup>m</sup>* which prove to be asymptotically nearly optimal (as specified above), uniformly with respect to any prior in some class. More precisely, for any fixed prior in that class, we aim at proving the validity of the asymptotic expansions (as *n* → +∞),

$$\mathbb{E}\left[\mathcal{L}\_{\mathbb{U}}(\mathcal{U}\_{n,m}, \mathcal{U}\_{n,m})\right] = \mathcal{R}\_{0,m} + \frac{1}{n}\mathcal{R}\_{1,m} + o\left(\frac{1}{n}\right) \tag{2}$$

$$\mathbb{E}\left[\mathcal{L}\_{\mathbb{U}}(\mathcal{U}\_{n,m}, \mathcal{U}\_{n,m}^\*)\right] = R\_{0,m}^\* + \frac{1}{n} \mathcal{R}\_{1,m}^\* + o\left(\frac{1}{n}\right),\tag{3}$$

along with *R*ˆ*i*,*<sup>m</sup>* = *R*<sup>∗</sup> *<sup>i</sup>*,*<sup>m</sup>* for *<sup>i</sup>* <sup>=</sup> 0, 1, where *<sup>U</sup>*<sup>ˆ</sup> *<sup>n</sup>*,*<sup>m</sup>* is the same as in (1). This is exactly the content of Theorem 5.1 and Corollary 5.1 in [15], which deal with the case where: *Un*,*<sup>m</sup>* = *T* (estimation of the parameter of the model), so that U coincides with the parameter space Θ ⊆ R; L<sup>U</sup> is the quadratic loss function, so that the risk function coincides with the mean square error; *<sup>U</sup>*<sup>ˆ</sup> *<sup>n</sup>*,*<sup>m</sup>* <sup>=</sup> <sup>E</sup>[*<sup>T</sup>* <sup>|</sup> *<sup>X</sup>*1, ... , *Xn*] is the Bayesian estimator with respect to <sup>L</sup>U; *U*∗ *<sup>n</sup>*,*<sup>m</sup>* coincides with the MLE; *R*ˆ 0,*<sup>m</sup>* = *R*<sup>∗</sup> 0,*<sup>m</sup>* = 0 and *<sup>R</sup>*<sup>ˆ</sup> 1,*<sup>m</sup>* = *<sup>R</sup>*<sup>∗</sup> 1,*<sup>m</sup>* <sup>=</sup> <sup>Θ</sup>[I(*θ*)]−1*π*(d*θ*), I denoting the Fisher information of the model and *π* being any prior on Θ with positive and sufficiently smooth density (with respect to the Lebesgue measure). Moving to truly predictive problems, the main operational solutions come from the empirical Bayes theory, which shares Equation (1) with the approach we are going to present. However, the empirical Bayes theory very soon leaves the "Bayesian main way" by bringing some sort of

Law of Large Numbers into the game, in order to replace the unknown quantities (usually, the prior itself). Here, on the contrary, we pursue Regazzini's project by proposing a new method that remains on the Bayesian main way. It consists of the following six steps.

**Step 1**. Reformulate problem (1) into another (orthodox Bayesian) estimation problem about *T*, the random parameter of the model. Roughly speaking, start from the following de Finetti representation:

$$\mathbb{P}[X\_1 \in A\_1, \dots, X\_k \in A\_k \mid T = \theta] = \mu^{\otimes\_k} (A\_1 \times \dots \times A\_k \mid \theta) := \prod\_{i=1}^k \mu(A\_i \mid \theta) \,, \tag{4}$$

valid for all *k* ∈ N, Borel sets *A*1, ... , *Ak*, *θ* ∈ Θ, and some probability kernel *μ*(·|·), which coincides with the statistical model for the single observation. Then, consider the following estimation problem: find

$$\mathcal{T}\_{n,m} = \operatorname{Argmin}\_{\mathcal{W}} \mathbb{E} \left[ \mathcal{L}\_{\Theta, (X\_1, \dots, X\_n)} (T, \mathcal{W}) \right] \,\text{\,}\tag{5}$$

where: LΘ,(*X*1,...,*Xn*) is a suitable loss function on Θ; *W* runs over the space of all Θ-valued, *σ*(*X*1, ... , *Xn*)-measurable random variables; the expectation is taken with respect to the joint distribution of (*X*1, ... , *Xn*) and *T*. The explicit definition of LΘ,(*X*1,...,*Xn*) is given in terms of a Wasserstein distance, as follows:

$$\mathcal{L}\_{\Theta,(\mathbf{x}\_1,\dots,\mathbf{x}\_n)}(\theta,\mathbf{r}) = \inf\_{\Gamma} \int\_{\mathbb{U}^2} \mathcal{L}\_{\mathbb{U}}(u,v)\Gamma(\mathbf{d}u \mathbf{d}v),\tag{6}$$

where Γ runs over the Fréchet class of all probability measures on U<sup>2</sup> with marginals *γθ*,(*x*1,...,*xn*) and *γτ*,(*x*1,...,*xn*), respectively, and *γθ*,(*x*1,...,*xn*) stands for the pull-back measure *<sup>μ</sup>*⊗*<sup>m</sup>* (·|*θ*) ◦ *un*,*m*(·; *<sup>x</sup>*1,..., *xn*; *<sup>θ</sup>*)−<sup>1</sup> on <sup>U</sup>.

**Step 2**. After getting the estimator *T*ˆ *n*,*m* from (5), consider estimators *U*∗ *<sup>n</sup>*,*<sup>m</sup>* that satisfy the following approximated version of problem (1): find

$$\mathrm{d}I\_{n,m}^\* = \mathrm{Argmin}\_Z \int\_{\mathcal{X}^n} \mathcal{L}\_{\mathbb{U}} \Big( u\_{n,m}(y\_1, \dots, y\_m; \mathcal{X}\_1, \dots, \mathcal{X}\_n; \mathcal{T}\_{n,m}), Z \Big) \mu^{\odot m} (\mathrm{d}y\_1 \dots \mathrm{d}y\_m \mid \mathcal{T}\_{n,m}), \tag{7}$$

where *Z* runs over the space of all U-valued, *σ*(*X*1,..., *Xn*)-measurable random variables.

**Step 3**. For the estimators *U*ˆ *<sup>n</sup>*,*<sup>m</sup>* and *U*<sup>∗</sup> *<sup>n</sup>*,*<sup>m</sup>* that solve (1) and (7) respectively, prove that (2) and (3) hold along with *R*ˆ*i*,*<sup>m</sup>* = *R*<sup>∗</sup> *<sup>i</sup>*,*<sup>m</sup>* for *i* = 0, 1. This entails the asymptotic almost efficiency of *U*∗ *<sup>n</sup>*,*m*, which it is still a prior-dependent estimator. In any case, this step is crucial to show that the loss function LΘ,(*x*1,...,*xn*) given in (6) is "Bayesianly well-conceived", that is, in harmony with the original aim displayed in (1).

**Step 4**. Identities (2) and (3) provides conditions on the statistical model *μ*(·|·) that possibly allows the existence of some *prior-free* estimator *T*˜ *<sup>n</sup>*,*<sup>m</sup>* of *T* which turns out to be asymptotically almost efficient, with respect to the same risk function as that displayed on the right-hand side of (5). More precisely, this fact consists of proving the validity of the following identities (as *n* → +∞)

$$\mathbb{E}\left[\mathcal{L}\_{\Theta,(X\_1,\dots,X\_n)}(T,\mathcal{T}\_{n,\mathfrak{m}})\right] = \mathfrak{d}\_{0,\mathfrak{m}} + \frac{1}{n}\mathfrak{d}\_{1,\mathfrak{m}} + o\left(\frac{1}{n}\right) \tag{8}$$

$$\mathbb{E}\left[\mathcal{L}\_{\Theta,(X\_1,\dots,X\_n)}(T,\tilde{T}\_{n,\mathfrak{m}})\right] = \tilde{\rho}\_{0,\mathfrak{m}} + \frac{1}{n}\tilde{\rho}\_{1,\mathfrak{m}} + o\left(\frac{1}{n}\right),\tag{9}$$

along with *ρ*ˆ*i*,*<sup>m</sup>* = *ρ*˜*i*,*<sup>m</sup>* for *i* = 0, 1, where *T*ˆ *n*,*m* is the same as in (5), for all prior distributions in a given class.

**Step 5**. After getting estimators *T*˜ *<sup>n</sup>*,*<sup>m</sup>* as in **Step 4**, consider the *prior-free* estimators *U*˜ *<sup>n</sup>*,*<sup>m</sup>* satisfying the analogous minimization problem as in (7), with *T*ˆ *<sup>n</sup>*,*<sup>m</sup>* replaced by *T*˜ *n*,*m*.

**Step 6**. For any estimator *U*˜ *<sup>n</sup>*,*<sup>m</sup>* found as in **Step 5**, prove the validity of the following identity (as *n* → +∞):

$$\mathbb{E}\left[\mathcal{L}\_{\mathbb{U}}(\mathcal{U}\_{n,m}, \tilde{\mathcal{U}}\_{n,m})\right] = \bar{\mathcal{R}}\_{0,m} + \frac{1}{n}\bar{\mathcal{R}}\_{1,m} + o\left(\frac{1}{n}\right),\tag{10}$$

along with *R*ˆ*i*,*<sup>m</sup>* = *R*˜*i*,*<sup>m</sup>* for *i* = 0, 1, where the *R*ˆ*i*,*m*'s are the same as in (2), for all prior distributions in the same class as specified in **Step 4**. This last step shows why and how the frequentist (i.e., prior-free) estimator *U*˜ *<sup>n</sup>*,*<sup>m</sup>* can be used, within the orthodox Bayesian framework, as a good approximation of the Bayesian estimator *U*ˆ *<sup>n</sup>*,*m*. This is particularly remarkable at least in two cases that do not exclude each other: when the estimator *T*˜ *n*,*m* obtained from **Step 4** is much simpler and numerically manageable than *T*ˆ *<sup>n</sup>*,*m*; when prior information is sufficient to characterize only a class of priors, but not a specific element of it.

This plan of action obeys the following principles:


The former principle, whose formalization constitutes the main novelty of this work, is concerned with the geometrical structure of the space of the parameters Θ. This is what we call a *relativistic principle* in point estimation theory: the goal of estimating a random quantity that depends on the observations (possibly besides the parameter) yields a modification of the geometry of Θ, to be now thought of as a curved space according to a non-trivial geometry. Of course, this modified geometry entails a coordinated notion of mean square error, now referred to the Riemannian geodesic distance. The term relativistic principle just hints at the original main principle of General Relativity Theory according to which the presence of a massive body modifies the geometry of the physical surrounding space, by means of the well-known Einstein tensor equations. These equations formalize a sort of compatibility between the physical and the geometric structures of the space. Thus, the identities (43) and (44), as stated in Section 3 to properly characterize the (Riemannian) metric on Θ, we will call *compatibility equations*. Actually, the idea of metrizing the parameter space Θ in a non standard way is well-known since the pioneering paper [19] by Radhakrishna Rao, and has received so much attention in the statistical literature to give birth to a fertile branch called *Information Geometry*. See, e.g., [20]. In particular, the concepts of efficiency, unbiasedness, Cramér–Rao lower bounds, Rao–Blackwell and Lehmann–Scheffé theorems are by far best-understood in this non-standard (i.e., non-Euclidean) setting. See [21]. In any case, to the best of our knowledge, this is the first work which connects the use of a non-standard geometric setting on Θ with predictive estimation problems—even if some hints can be drawn from [22]. In our opinion, the lack of awareness about the aforesaid relativistic principle, combined with an abuse of the quadratic loss function on Θ, has produced a lot of actually sub-efficient algorithms, most of which focused on the estimation of certain probabilities, or of nonparametric objects. In these cases, the efficiency of the ensuing estimators is created artificially through a misuse of the quadratic loss, and it proves to be drastically downsized whenever these estimators are evaluated by means of other, more concrete loss functions which take account (as in (6)) of the natural geometry of the spaces of really observable quantities. To get an idea of this phenomenon, see the discussion about Robbins' estimators in Section 4.4 below.

#### *1.2. Organization of the Paper*

We conclude the introduction by summarizing the main results of the paper, which are threefold. The first block of results, including Theorem 1, Proposition 1 and Lemma 1 in Section 2.2, concerns some refinement of de Finetti's Law of Large Numbers for the log-likelihood process. The second block of theoretical results, developed in Section 3, contains:


The last block of results, contained in Section 4, consists of explicit verifications of the compatibility equations for some simple statistical models (Sections 4.1–4.3), and also the adaptation of our plan of action to the same Poisson-mixture model used by Herbert Robbins in [23] to illustrate his empirical Bayes approach to predictive inference (Section 4.4). Finally, all the proofs of the theoretical results are deferred to Section 5, while some conclusions and future developments are hinted at in Section 6.

## **2. Technical Preliminaries**

We begin by rigorously fixing the mathematical setting, split into two subsections. The former will contain a very general framework which will serve to give a precise meaning to the questions presented in the Introduction and to state in full generality one of the main results, that is, Proposition 2 in Section 3. In fact, this statement will include some inequalities that, by carrying out the goal described in point (B) of the Introduction will constitute the starting point for all the results presented in Section 3. The second subsection will deal with a simplification of the original setting—essentially based on additional regularity conditions for the spaces U and Θ and for the statistical model *μ*(·|·)—aimed at introducing the novel compatibility equations without too many technicalities.

## *2.1. The General Framework*

Let (X, X ) and (Θ, T ) be standard Borel spaces called *sample space* (for any single observation) and *parameter space*, respectively. Consider a sequence {*Xi*}*i*≥<sup>1</sup> of X-valued random variables (r.v.'s, from now on) along with another Θ-valued r.v. *T*, all the *Xi*'s and *T* being defined on a suitable probability space (Ω, <sup>F</sup>, <sup>P</sup>). Assume that (4) holds for all *<sup>k</sup>* <sup>∈</sup> <sup>N</sup>, *A*1, ... , *Ak* ∈ X and *θ* ∈ Θ with some given probability kernel *μ*(·|·) : X × Θ → [0, 1], called *statistical model* (for any single observation). The validity of (4) entails that the *Xi*'s are *exchangeable* and that

$$\mathbb{P}[X\_1 \in A\_1, \dots, X\_k \in A\_k] = \int\_{\Theta} \mu^{\otimes\_k} (A\_1 \times \dots \times A\_k \mid \theta) \pi(\mathbf{d}\theta) =: a\_k (A\_1 \times \dots \times A\_k) \tag{11}$$

holds for all *k* ∈ N and *A*1, ... , *Ak* ∈ X with some given probability measure (p.m.) *π* on (Θ, T ) called *prior distribution*. Identity (11) uniquely characterizes the p.m. *α<sup>k</sup>* on (X*k*, X *<sup>k</sup>*) for any *<sup>k</sup>* <sup>∈</sup> <sup>N</sup>, this p.m. being called *law of <sup>k</sup>*-*observations*, where <sup>X</sup>*<sup>k</sup>* (<sup>X</sup> *<sup>k</sup>*, respectively) denotes the *k*-fold cartesian product (*σ*-algebra product, respectively) of *k* copies of X (X , respectively). Moreover, let

$$\begin{aligned} \pi\_k(B \mid \mathbf{x}\_1, \dots, \mathbf{x}\_k) &:= \mathbb{P}[T \in B \mid X\_1 = \mathbf{x}\_1, \dots, X\_k = \mathbf{x}\_k] \\ \beta\_k(A \mid \mathbf{x}\_1, \dots, \mathbf{x}\_k) &:= \mathbb{P}[X\_{k+1} \in A \mid X\_1 = \mathbf{x}\_1, \dots, X\_k = \mathbf{x}\_k] \end{aligned}$$

be two probability kernels, with *<sup>π</sup>k*(·|·) : <sup>T</sup> <sup>×</sup> <sup>X</sup>*<sup>k</sup>* <sup>→</sup> [0, 1] and *<sup>β</sup>k*(·|·) : <sup>X</sup> <sup>×</sup> <sup>X</sup>*<sup>k</sup>* <sup>→</sup> [0, 1], defined as respective solutions of the following disintegration problems

$$\begin{aligned} \mathbb{P}[X\_1 \in A\_1, \dots, X\_k \in A\_k \, \_t T \in B] &= \int\_{A\_1 \times \dots \times A\_k} \pi\_k(B \mid \mathbf{x}\_1, \dots, \mathbf{x}\_k) a\_k(\mathbf{dx}\_1 \dots \mathbf{dx}\_k) \\\mathbb{P}[X\_1 \in A\_1, \dots, X\_k \in A\_k \, \_t X\_{k+1} \in A] &= \int\_{A\_1 \times \dots \times A\_k} \beta\_k(A \mid \mathbf{x}\_1, \dots, \mathbf{x}\_k) a\_k(\mathbf{dx}\_1 \dots \mathbf{dx}\_k) \end{aligned}$$

for any *k* ∈ N, *A*1, ... , *Ak*, *A* ∈ X and *B* ∈ T . The probability kernels *πk*(·|·) and *βk*(·|·) are called *posterior distribution* and *predictive distribution*, respectively.

Let (U, dU) be a Polish metric space and, for fixed *<sup>n</sup>*, *<sup>m</sup>* <sup>∈</sup> <sup>N</sup>, let *un*,*<sup>m</sup>* : <sup>X</sup>*<sup>m</sup>* <sup>×</sup> <sup>X</sup>*<sup>n</sup>* <sup>×</sup> <sup>Θ</sup> <sup>→</sup> <sup>U</sup> be a measurable map. Let *Un*,*<sup>m</sup>* := *un*,*m*(*Xn*+1, ... , *Xn*+*m*; *X*1, ... , *Xn*; *T*) be the random quantity to be estimated with respect to the loss function <sup>L</sup>U(*u*, *<sup>v</sup>*) :<sup>=</sup> d2 <sup>U</sup>(*u*, *v*). Now, recall the notion of *barycenter* (also known as Fréchet mean) of a given p.m.. Let (S, dS) be a Polish metric space, endowed with its Borel *σ*-algebra B(S). Given a p.m. *μ* on (S, B(S)), define

$$\mathbf{Bary}\_{\mathfrak{S}}[\mu; \mathbf{d}\_{\mathfrak{S}}] := \mathbf{Aragmin}\_{\mathfrak{Y} \in \mathfrak{S}} \int\_{\mathfrak{S}} \mathbf{d}\_{\mathfrak{S}}^2(x, y)\mu(\mathbf{d}x)$$

provided that *μ* has finite second moment (*μ* ∈ P2(S, dS), in symbols) and that at least one minimum point exists. See [24–26] for results on existence, uniqueness and some characterizations of barycenters. Then, put

$$\rho\_{n,m}(\mathbb{C} \mid \mathbf{x}\_1, \dots, \mathbf{x}\_n) := \mathbb{P}[\mathcal{U}\_{n,m} \in \mathbb{C} \mid X\_1 = \mathbf{x}\_1, \dots, X\_n = \mathbf{x}\_n],$$

meaning that *<sup>ρ</sup>n*,*m*(·|·) : <sup>B</sup>(U) <sup>×</sup> <sup>X</sup>*<sup>n</sup>* <sup>→</sup> [0, 1] is a probability kernel that solves the disintegration problem

$$\mathbb{P}[X\_1 \in A\_1, \dots, X\_n \in A\_n \: \mathcal{U}\_{n,n} \in \mathbb{C}] = \int\_{A\_1 \times \dots \times A\_n} \rho\_{n,m}(\mathbb{C} \mid \mathbf{x}\_1, \dots, \mathbf{x}\_n) \mathbf{a}\_k(\mathbf{dx}\_1 \dots \mathbf{dx}\_k) \, d\mathbf{x}\_1$$

for any *<sup>A</sup>*1, ... , *An* <sup>∈</sup> <sup>X</sup> and *<sup>C</sup>* <sup>∈</sup> <sup>B</sup>(U). If <sup>E</sup>[d2 <sup>U</sup>(*Un*,*m*, *u*0)] < +∞ for some *u*<sup>0</sup> ∈ U and BaryU[*ρn*,*m*(· | *x*1,..., *xn*); dU] exists uniquely for *αn*-almost all (*x*1,..., *xn*), then

$$\hat{\mathcal{U}}\_{n,\mathfrak{m}} = \text{Bary}\_{\mathbb{U}}[\rho\_{n,\mathfrak{m}}(\cdot \mid X\_1, \dots, X\_n); \mathbf{d}\_{\mathbb{U}}] \tag{12}$$

solves the minimization problem (1). To give an analogous formalization to the minimization problem (7), define

$$\gamma\_{\theta,(\mathbf{x}\_1,\dots,\mathbf{x}\_n)}(\mathbb{C}) := \mu^{\otimes\_{\mathbf{m}}} \left( \{ (y\_1,\dots,y\_m) \in \mathbb{X}^{\mathbf{m}} \mid \mathsf{u}\_{n,\mathbf{m}}(y\_1,\dots,y\_m;\mathbf{x}\_1,\dots,\mathbf{x}\_n;\theta) \in \mathbb{C} \} \; \middle| \; \theta \right)$$

for any *<sup>θ</sup>* <sup>∈</sup> <sup>Θ</sup>, (*x*1, ... , *xn*) <sup>∈</sup> <sup>X</sup>*<sup>n</sup>* and *<sup>C</sup>* <sup>∈</sup> <sup>B</sup>(U). Again, if *γθ*,(*x*1,...,*xn*) ∈ P2(U, dU) and BaryU[*γθ*,(*x*1,...,*xn*)(·); dU] exists uniquely for any *θ* ∈ Θ and *αn*-almost all (*x*1,..., *xn*), then

$$dL^\*\_{n,m} = \mathbf{Bary}\_{\mathbb{U}}[\gamma\_{\mathbb{Y}\_{n,m}(X\_1,\ldots,X\_n)}; \mathbf{d}\_{\mathbb{U}}] \tag{13}$$

solves the minimization problem (7). By the way, notice that a combination of de Finetti's representation theorem with basic properties of conditional distributions entails that

$$\rho\_{\mathbb{M},\mathbb{M}}(\mathbb{C} \mid \mathbf{x}\_1, \dots, \mathbf{x}\_{\mathbb{M}}) = \int\_{\Theta} \gamma\_{\theta, (\mathbf{x}\_1, \dots, \mathbf{x}\_{\mathbb{M}})}(\mathbb{C}) \, \pi\_{\mathbb{R}}(\mathbf{d}\theta \mid \mathbf{x}\_1, \dots, \mathbf{x}\_{\mathbb{R}}) \tag{14}$$

for *αn*-almost all (*x*1, ... , *xn*). It remains to formalize the minimization problem (5). If *γθ*,(*x*1,...,*xn*), *γτ*,(*x*1,...,*xn*) ∈ P2(U, dU), then the loss function in (6) satisfies

$$\mathcal{L}\_{\Theta,(\mathbf{x}\_1,\dots,\mathbf{x}\_n)}(\theta,\tau) = \mathcal{W}\_{\mathbb{U}}^2(\gamma\_{\theta,(\mathbf{x}\_1,\dots,\mathbf{x}\_n)}; \gamma\_{\tau,(\mathbf{x}\_1,\dots,\mathbf{x}\_n)})\,\,\_{\mathsf{V}}$$

where W<sup>U</sup> denotes the 2-Wasserstein distance on P2(U, dU). See [27] [Chapters 6–7] for more information on the Wasserstein distance. Therefore, if *πn*(· | *x*1, ... , *xn*) ∈ <sup>P</sup>2(Θ,L1/2 Θ,(*x*1,...,*xn*) ) and BaryΘ[*πn*(· | *<sup>x</sup>*1, ... , *xn*);L1/2 Θ,(*x*1,...,*xn*) ] exists uniquely for *αn*-almost all (*x*1,..., *xn*), then

$$\mathcal{T}\_{n,m} = \text{Bary}\_{\Theta} \Big[ \pi\_n(\cdot \mid X\_1, \dots, X\_n); \mathcal{L}^{1/2}\_{\Theta, (X\_1, \dots, X\_n)} \Big] \tag{15}$$

solves the minimization problem (5).

To conclude, it remains to formalize the definition of various Bayesian risk functions, that will appear in the formulation of the main results. For any estimator *U*† *<sup>n</sup>*,*<sup>m</sup>* = *u*† *<sup>n</sup>*,*m*(*X*1,..., *Xn*) of *Un*,*m*, obtained with a measurable *u*† *<sup>n</sup>*,*<sup>m</sup>* : <sup>X</sup>*<sup>n</sup>* <sup>→</sup> <sup>U</sup>, put

$$\begin{split} \mathfrak{M}\_{\mathbb{U}}[\mathcal{U}\_{n,m}^{\dagger}] &:= \mathbb{E}\Big[\mathcal{L}\_{\mathbb{U}}(\mathcal{U}\_{n,m}, \mathcal{U}\_{n,m}^{\dagger})\Big] \\ &= \int\_{\Theta} \int\_{X^{n+m}} \mathcal{L}\_{\mathbb{U}}\Big(u\_{n,m}(\mathbf{y}; \mathbf{x}; \boldsymbol{\theta}), u\_{n,m}^{\dagger}(\mathbf{x})\Big) \mu^{\odot n + m}(\operatorname{dydx} \mid \boldsymbol{\theta}) \pi(\mathbf{d}\boldsymbol{\theta}) \\ &= \int\_{X^{n}} \int\_{\Theta} \int\_{X^{n}} \mathcal{L}\_{\mathbb{U}}\Big(u\_{n,m}(\mathbf{y}; \mathbf{x}; \boldsymbol{\theta}), u\_{n,m}^{\dagger}(\mathbf{x})\Big) \mu^{\odot n}(\mathbf{d}\mathbf{y} \mid \boldsymbol{\theta}) \pi\_{n}(\mathbf{d}\boldsymbol{\theta} \mid \mathbf{x}) a\_{n}(\mathbf{dx}) \end{split} \tag{16}$$

provided that the integrals are finite. Here and throughout, the bold symbols **x**, **y** are just short-hands to denote the vectors (*x*1, ... , *xn*) and (*y*1, ... , *ym*), respectively. Analogously, for any estimator *T*† *<sup>n</sup>*,*<sup>m</sup>* = *t* † *<sup>n</sup>*,*m*(*X*1,..., *Xn*) of *T*, obtained with a measurable *t* † *<sup>n</sup>*,*<sup>m</sup>* : <sup>X</sup>*<sup>n</sup>* <sup>→</sup> <sup>Θ</sup>, put

$$\begin{split} \mathfrak{N}\_{\Theta}[T\_{n,\mathfrak{m}}^{\dagger}] &:= \mathbb{E}\left[\mathcal{L}\_{\Theta,(X\_{1},\ldots,X\_{n})}(T, T\_{n,\mathfrak{m}}^{\dagger})\right] \\ &= \int\_{\Theta} \int\_{\mathcal{X}^{\mathfrak{n}}} \mathcal{L}\_{\Theta,\mathbf{x}}\Big(\theta, t\_{n,\mathfrak{m}}^{\dagger}(\mathbf{x})\Big) \mu^{\odot\_{\mathfrak{n}}}(\mathbf{dx}\mid\theta)\pi(\mathbf{d}\theta) \\ &= \int\_{\mathcal{X}^{\mathfrak{n}}} \int\_{\Theta} \mathcal{L}\_{\Theta,\mathbf{x}}\Big(\theta, t\_{n,\mathfrak{m}}^{\dagger}(\mathbf{x})\Big) \pi\_{n}(\mathbf{d}\theta\mid\mathbf{x}) a\_{n}(\mathbf{dx}) \end{split} \tag{17}$$

provided that the integrals are finite.

## *2.2. The Simplified Framework*

Start by assuming that U = R and LU(*u*, *v*) = |*u* − *v*| 2. Then, restrict the attention to those predictive problems in which the quantity to be estimated depends only on the new observations *Xn*+1, ... , *Xn*+*<sup>m</sup>* and on the random parameter *T*, but not on the observable variables *X*1, ... , *Xn*. This restriction is actually non-conceptual, and it is made only to diminish the mathematical complexity of the ensuing asymptotic expansions (valid as *n* → +∞), having this way fewer sources of dependence from the variable *n*. Thus, the quantity to be estimated has the form *um*(*Xn*+1, ... , *Xn*+*m*; *T*) for some measurable *um* : <sup>X</sup>*<sup>m</sup>* <sup>×</sup> <sup>Θ</sup> <sup>→</sup> <sup>R</sup>. From now on, it will be assumed that

$$\mathbb{E}\left[\left(\mu\_m(X\_{n+1},\ldots,X\_{n+m};T)\right)^2\right] < +\infty\,. \tag{18}$$

Whence, for the Bayesian estimator *U*ˆ *<sup>n</sup>*,*<sup>m</sup>* in (12) existence and uniqueness are wellknown: its explicit form is given by *U*ˆ *<sup>n</sup>*,*<sup>m</sup>* = *u*ˆ*n*,*m*(*X*1,..., *Xn*) with

$$\begin{split} \mathfrak{A}\_{\mathfrak{n},\mathfrak{m}}(\mathbf{x}\_{1},\ldots,\mathbf{x}\_{\mathfrak{n}}) &= \mathbb{E}[\mathfrak{u}\_{\mathfrak{m}}(\mathbf{X}\_{\mathfrak{n}+1},\ldots,\mathbf{X}\_{\mathfrak{n}+\mathfrak{m}};T) \mid \mathbf{X}\_{1} = \mathbf{x}\_{1},\ldots,\mathbf{X}\_{\mathfrak{n}} = \mathbf{x}\_{\mathfrak{n}}] \\ &= \int\_{\Theta} \int\_{\mathcal{X}^{\mathfrak{m}}} \mathfrak{u}\_{\mathfrak{m}}(y\_{1},\ldots,y\_{\mathfrak{m}};\theta) \mu^{\otimes\_{\mathfrak{m}}}(\mathsf{d}y\_{1}\ldots\mathsf{d}y\_{\mathfrak{m}} \mid \theta) \pi\_{\mathfrak{n}}(\mathsf{d}\theta \mid \mathbf{x}\_{1},\ldots,\mathbf{x}\_{\mathfrak{n}}) .\end{split}$$

which is finite for *αn*-almost all (*x*1, ... , *xn*). The risk function R<sup>U</sup> evaluated at *U*ˆ *<sup>n</sup>*,*<sup>m</sup>* achieves its overall minimum value and, from (16), it takes the form:

$$\mathfrak{R}\_{\mathbb{U}}[\Omega\_{\mathfrak{n},\mathfrak{m}}] = \int\_{\mathfrak{X}^{\mathfrak{n}}} \left\{ \int\_{\Theta} \mathbf{v}(\theta) \, \pi\_{\mathfrak{n}}(\mathrm{d}\theta \mid \mathbf{x}) + \int\_{\Theta} [\mathfrak{m}(\theta)]^2 \pi\_{\mathfrak{n}}(\mathrm{d}\theta \mid \mathbf{x}) \right. \\ \left. - \left( \int\_{\Theta} \mathfrak{m}(\theta) \, \pi\_{\mathfrak{n}}(\mathrm{d}\theta \mid \mathbf{x}) \right)^2 \right\} a\_{\mathfrak{n}}(\mathrm{d}\mathbf{x}), \tag{19}$$

with

$$\begin{aligned} \mathfrak{m}(\theta) &:= \int\_{\mathcal{X}^{\mathfrak{m}}} u\_{\mathfrak{m}}(y\_1, \dots, y\_{\mathfrak{m}}; \theta) \mu^{\otimes\_{\mathfrak{m}}}(\mathrm{d}y\_1 \dots \mathrm{d}y\_{\mathfrak{m}} \mid \theta) \\ \mathfrak{m}(\theta) &:= \int\_{\mathcal{X}^{\mathfrak{m}}} [u\_{\mathfrak{m}}(y\_1, \dots, y\_{\mathfrak{m}}; \theta) - \mathfrak{m}(\theta)]^2 \mu^{\otimes\_{\mathfrak{m}}}(\mathrm{d}y\_1 \dots \mathrm{d}y\_{\mathfrak{m}} \mid \theta) \end{aligned}$$

thanks to the well-known "Law of Total Variance". See, e.g., [28] [Problem 34.10(b)]. As to the issue of estimating *T*, the first remarkable simplification induced by the above assumptions is that the p.m. *γθ*,(*x*1,...,*xn*) is independent of (*x*1,..., *xn*). Whence,

$$\Delta(\theta,\tau) := [\mathcal{L}\_{\Theta,(\mathbf{x}\_1,\ldots,\mathbf{x}\_n)}(\theta,\tau)]^{1/2} = \mathcal{W}\mathfrak{U}\left(\gamma\_{\theta,(\mathbf{x}\_1,\ldots,\mathbf{x}\_n)};\gamma\_{\tau,(\mathbf{x}\_1,\ldots,\mathbf{x}\_n)}\right),\tag{20}$$

is, in turn, independent of (*x*1,..., *xn*) and *defines a distance* on Θ provided that

$$
\gamma\_{\theta,(\mathbf{x}\_1,\dots,\mathbf{x}\_n)} = \gamma\_{\tau,(\mathbf{x}\_1,\dots,\mathbf{x}\_n)}.
$$

entails *θ* = *τ*. Thus, for any estimator *T*† *<sup>n</sup>*,*<sup>m</sup>* = *t* † *<sup>n</sup>*,*m*(*X*1, ... , *Xn*) of *T*, obtained with a measurable *t* † *<sup>n</sup>*,*<sup>m</sup>* : <sup>X</sup>*<sup>n</sup>* <sup>→</sup> <sup>Θ</sup>, (17) becomes

$$\mathfrak{R}\_{\Theta}[T\_{n,m}^{\dagger}] = \int\_{\mathcal{X}^{n}} \int\_{\Theta} [\Delta(\theta, t\_{n,m}^{\dagger}(\mathbf{x}))]^{2} \pi\_{n}(\mathbf{d}\theta \mid \mathbf{x}) \mathfrak{a}\_{n}(\mathbf{dx}) \,. \tag{21}$$

The last simplifications concern the basic object of the inference, i.e., the statistical model *μ*(·|·) and the prior *π*. First, assume that Θ = (*a*, *b*) ⊆ R and that *π* has a density *p* (with respect to the Lebesgue measure). Even if this one-dimensionality assumption can seem a drastic simplification, it is again of a non-conceptual nature, and it is made to diminish the mathematical complexity of the ensuing statements. In fact, one of the goals of this work is to provide a Riamannian-like characterization of the metric space (Θ, Δ), and this is particularly simple in such a one-dimensional setting. The following arguments should be quite easily reproduced at least in a finite-dimensional setting (i.e., when <sup>Θ</sup> <sup>⊆</sup> <sup>R</sup>*d*) by using basic tools of Riemannian geometry, such as local expansions of the geodesic distance. See, e.g., [29] [Chapter 5]. As to the statistical model *μ*(·|·), consider the following:

**Assumption 1.** *μ*(·|·) *is dominated by some σ-finite measure χ on* (X, X ) *with a (distinguished version of the) density f*(·|*θ*) *that satisfies:*


$$\mathsf{K}(t \parallel \theta) := \int\_{\mathsf{X}} \left( \frac{\log f(\mathbf{x} \mid t)}{\log f(\mathbf{x} \mid \theta)} \right) \mu(\mathbf{dx} \mid t) \tag{22}$$

*is well-defined.*

A canonical choice for the Hilbert space H is in the form of a *weighted Sobolev space <sup>H</sup>r*(Θ; *<sup>π</sup>*) for some *<sup>r</sup>* <sup>≥</sup> 1. See, e.g., [30,31] for definition and further properties of weighted Sobolev spaces, such as embedding theorems. By the way, it is worth remarking that such assumptions are made to easily state the following results. It is plausible they could be relaxed in future works.

In this regularity setting, introduce the sequence {*Hn*}*n*≥1, where *Hn* : Ω → H represents the (normalized) *log-likelihood process*, that is

$$H\_{\mathbb{H}} := \frac{1}{n} \ell\_{\mathbb{H}}(\cdot; \mathbf{X}\_1, \dots, \mathbf{X}\_{\mathbb{H}}) := \frac{1}{n} \sum\_{i=1}^{n} \log f(\mathbf{X}\_i | \cdot) = \int\_{\mathcal{X}} \log f(\boldsymbol{\xi} | \cdot) \mathbf{e}\_n^{(\mathbf{X}\_1, \dots, \mathbf{X}\_{\mathbb{H}})} (\mathbf{d}\_{\boldsymbol{\xi}}^{\mathbf{x}}) \tag{23}$$

the symbol e (*X*1,...,*Xn*) *<sup>n</sup>* standing for the *empirical measure* based on (*X*1,..., *Xn*), i.e.,

$$\mathfrak{e}\_n^{(\mathcal{X}\_1,\dots,\mathcal{X}\_n)} := \frac{1}{n} \sum\_{i=1}^n \delta\_{X\_i}\dots$$

For completeness, any notation like *n*(·; *X*1, ... , *Xn*) is just a short-hand to denote the entire function *θ* → *n*(*θ*; *X*1, ... , *Xn*). First of all, observe that *Hn* is a *sufficient statistics* in both classical and Bayesian sense. See [11]. Then, a version of de Finetti's Law of Large Numbers (see [9,32]) for the log-likelihood process can be stated as follows:

**Theorem 1.** *Under Assumption 1, define the following* H*-valued r.v.*

$$H := \int\_{\mathcal{X}} \log f(z \mid \cdot) \mu(\mathrm{d}z \mid T) = -\mathsf{K}(T \parallel \cdot) + \int\_{\mathcal{X}} \log f(z \mid T) \mu(\mathrm{d}z \mid T)$$

*along with <sup>ν</sup>n*(*D*) :<sup>=</sup> <sup>P</sup>[*Hn* <sup>∈</sup> *<sup>D</sup>*] *and <sup>ν</sup>*(*D*) :<sup>=</sup> <sup>P</sup>[*<sup>H</sup>* <sup>∈</sup> *<sup>D</sup>*]*, for any <sup>D</sup>* <sup>∈</sup> <sup>B</sup>(H)*. Then, it holds that*

$$H\_n \xrightarrow{L^2} H \tag{24}$$

*which, in turn, yields that ν<sup>n</sup>* ⇒ *ν, where* ⇒ *denotes weak convergence of p.m.'s on* (H, B(H))*.*

Then, to carry out the objectives mentioned in the Introduction, a quantitative refinement of the thesis *ν<sup>n</sup>* ⇒ *ν* is needed, as stated in the following proposition.

**Proposition 1.** *Let C*<sup>2</sup> *<sup>b</sup>* (H) *denote the space of bounded,* <sup>C</sup><sup>2</sup> *functionals on* <sup>H</sup>*. Besides Assumption 1, suppose there exists a function* Γ(·; *μ*, *π*) : H → R *such that*

$$\frac{1}{2}\mathbb{E}\left[\text{Hess}[\Psi]\_{H}\otimes\text{Cov}[\log f(X\_{i}|\cdot)]\right] = \mathbb{E}[\Psi(H)\Gamma(H;\mu,\pi)]\tag{25}$$

*holds for all functional* <sup>Ψ</sup> <sup>∈</sup> <sup>C</sup><sup>2</sup> *<sup>b</sup>*(H)*, where* Hess[Ψ]*<sup>h</sup> denotes the Hessian of* Ψ *at h* ∈ H*,* ⊗ *is the tensor product between quadratic forms (operators) and* Cov*t*[log *f*(*Xi*| ·)] *stands for the covariance operator of the* H*-valued r.v.'s* log *f*(*Xi*| ·) *with respect to the p.m. μ*(· | *t*)*. Then,*

$$\int\_{\mathcal{H}} \Psi(h) v\_{\hbar}(\mathbf{d}h) = \int\_{\mathcal{H}} \Psi(h) \nu(\mathbf{d}h) + \frac{1}{n} \int\_{\mathcal{H}} \Psi(h) \Gamma(h; \mu, \pi) \nu(\mathbf{d}h) + o\left(\frac{1}{n}\right) \tag{26}$$

*holds as n* → +∞ *for all continuous* Ψ : H → R *for which the above integrals are convergent.*

For further information on second-order differentiability in Hilbert/Banach spaces, see [33,34]. By the way, the above identity (26) is a quantitative strengthening of de Finetti's theorem similar to the identities stated in Theorem 1.1 of [8] [Chapter 6], valid in a finitedimensional setting. Later on, we will resort to *uniform* versions of (26), meaning that the *o*( <sup>1</sup> *<sup>n</sup>* )-term is uniformly bounded with respect to *h*. However, such a kind of results—much more in the spirit of the Central Limit Theorem—are very difficult to prove and, to the best of the author's knowledge, there are no known results in infinite-dimension. Examples in

finite-dimensional settings are given in [35,36], which prove Berry–Esseen like inequalities in the very specific context of Bernoulli r.v.s. See also [37]. Anyway, since one merit of [35] is to show how to use the classical Central Limit Theorem to prove an expansion as in (26), one could hope to follow that very same line of reasoning by resorting to some version of the central limit theorem for Banach spaces, such as that stated in [38]. Research on this is ongoing.

Now, to make the above Proposition 1 a bit more concrete, it is worth noticing the case in which *f*(·|*θ*) is in exponential form. In fact, in this case, the identity (26) can be rewritten in a simpler form, condensed in the following statement.

**Lemma 1.** *Besides Assumption 1, suppose that f*(*x* | *θ*) = exp{*θS*(*x*) − *M*(*θ*)}*, with some measurable <sup>S</sup>* : <sup>X</sup> <sup>→</sup> <sup>R</sup> *and <sup>M</sup>*(*θ*) :<sup>=</sup> log- <sup>X</sup> *<sup>e</sup>θS*(*x*)*χ*(d*x*) ∈ R *for all θ* ∈ Θ*. Then,* (26) *holds with*

$$\nu(D) := \mathbb{P}\left[\left(\theta \mapsto \theta M'(T) - M(\theta)\right) \in D\right]$$

*and*

$$\Gamma\left( (\theta \mapsto \theta M'(t) - M(\theta)); \mu, \pi\right) = \frac{M''(t)}{p(t)} \frac{\mathrm{d}^2}{\mathrm{d}y^2} \left[ M''(V(y)) p(V(y)) V'(y) \right]\_{\left| y = M'(t)' \right|}$$

*where V*(*M* (*t*)) = *t for any t* ∈ Θ*.*

To conclude this subsection, consider the expressions (19)–(21) and notice that they depend explicitly on the posterior distribution *πn*(· | *x*1, ... , *xn*). Now, thanks to Assumption 1, the mapping *t* → *δ<sup>t</sup>* can be seen as defined on Θ and taking values in the dual space H∗, with Riesz representative h*<sup>t</sup>* ∈ H. More formally, for any *h* ∈ H and *t* ∈ Θ, it holds that *<sup>h</sup>*(*t*) = H\$*h*, *<sup>δ</sup>t*%H∗ = \$*h*, <sup>h</sup>*t*%, where \$·, ·% stands for the scalar product on H while H\$·, ·%H∗ denotes the pairing between H and H∗. In this notation, the posterior distribution can be rewritten in exponential form as:

$$\pi\_n(B \mid X\_1, \dots, X\_n) = \frac{\int\_B \exp\{n \langle H\_n, \mathfrak{h}\_\theta \rangle\} \pi(\mathbf{d}\theta)}{\int\_\Theta \exp\{n \langle H\_n, \mathfrak{h}\_\theta \rangle\} \pi(\mathbf{d}\theta)} = \pi\_n^\*(B \mid H\_n) \tag{27}$$

for any *B* ∈ T , the probability kernel *π*<sup>∗</sup> *<sup>n</sup>*(·|·) : T ×H→ [0, 1] being defined by

$$\pi\_n^\*(B \mid h) := \frac{\int\_B \exp\{n \langle h, \mathfrak{h}\_\theta \rangle\} \pi(\mathbf{d}\theta)}{\int\_\Theta \exp\{n \langle h, \mathfrak{h}\_\theta \rangle\} \pi(\mathbf{d}\theta)} \,. \tag{28}$$

This is particularly interesting because it shows that the posterior distribution can always be thought of, in the presence of a dominated statistical model characterized by strictly positive, smooth densities, as an element of an exponential family, even if the original statistical model *μ*(·|·) is not in exponential form. By utilizing the kernel *π*<sup>∗</sup> *<sup>n</sup>* in combination with the p.m. *νn*, the following re-writings of (19)–(21) are valid:

$$\mathfrak{R}\_{\rm U}[\hat{\Omega}\_{\mathfrak{n},\mathfrak{m}}] = \int\_{\mathcal{H}} \left\{ \int\_{\Theta} \mathbf{v}(\theta) \pi\_{n}^{\*} (\mathbf{d}\theta \mid h) + \int\_{\Theta} [\mathfrak{m}(\theta)]^{2} \pi\_{n}^{\*} (\mathbf{d}\theta \mid h) \right. $$

$$ - \left( \int\_{\Theta} \mathfrak{m}(\theta) \pi\_{n}^{\*} (\mathbf{d}\theta \mid h) \right)^{2} \Bigg\} \nu\_{n}(\mathbf{d}h) \tag{29} $$

$$\mathfrak{R}\_{\Theta}[T\_{n,\mathfrak{m}}^{\dagger}] = \int\_{\mathcal{H}} \int\_{\Theta} [\Delta(\theta, \mathfrak{T}\_{n,\mathfrak{m}}^{\dagger}(h))]^2 \pi\_n^\*(\mathbf{d}\theta \mid h) \nu\_n(\mathbf{d}h),\tag{30}$$

where the mapping T† *<sup>n</sup>*,*<sup>m</sup>* is such that T† *<sup>n</sup>*,*m*(*Hn*) = *t* † *<sup>n</sup>*,*m*(*X*1,..., *Xn*) holds <sup>P</sup>-a.s.

#### **3. Main Results**

The first result establishes a relationship between the Bayesian risk functions RU and R<sup>Θ</sup> defined in (16) and (17), respectively. Due to the central role of this relationship, it will be formulated within the general framework described in Section 2.1.

**Proposition 2.** *Consider any estimator U*† *<sup>n</sup>*,*<sup>m</sup>* = *u*† *<sup>n</sup>*,*m*(*X*1, ... , *Xn*) *of Un*,*<sup>m</sup> and any estimator T*† *<sup>n</sup>*,*<sup>m</sup>* = *t* † *<sup>n</sup>*,*m*(*X*1, ... , *Xn*) *of <sup>T</sup> such that* <sup>E</sup>[d<sup>2</sup> U(*U*† *<sup>n</sup>*,*m*, *u*0)] < +∞ *holds for some u*<sup>0</sup> ∈ U *along with* E <sup>L</sup>Θ,(*X*1,...,*Xn*)(*T*† *<sup>n</sup>*,*m*, *t*0) < +∞ *for some t*<sup>0</sup> ∈ Θ*. Then, it holds*

$$\begin{split} \mathfrak{R}\_{\mathsf{U}}[\mathsf{U}^{\dagger}\_{\mathsf{u},\mathsf{m}}] &\leq \mathfrak{R}\_{\mathsf{O}}[T^{\dagger}\_{\mathsf{u},\mathsf{m}}] + \mathbb{E}\left[\int\_{\mathsf{U}} \mathrm{d}^{2}\_{\mathsf{U}}(\mathsf{L}^{\dagger}\_{\mathsf{u},\mathsf{m}},\mathsf{u}) \gamma\_{T^{\dagger}\_{\mathsf{u},\mathsf{m}}(\mathsf{X}\_{1},\ldots,\mathsf{X}\_{\mathsf{n}})}(\mathsf{d}\mathsf{u})\right] \\ &+ 2\mathbb{E}\left[\mathcal{L}^{1/2}\_{\mathsf{O},(\mathsf{X}\_{1},\ldots,\mathsf{X}\_{\mathsf{n}})}(T,T^{\dagger}\_{\mathsf{u},\mathsf{m}}) \left(\int\_{\mathsf{U}} \mathrm{d}^{2}\_{\mathsf{U}}(\mathsf{L}^{\dagger}\_{\mathsf{u},\mathsf{m}},\mathsf{u}) \gamma\_{T^{\dagger}\_{\mathsf{u},\mathsf{m}}(\mathsf{X}\_{1},\ldots,\mathsf{X}\_{\mathsf{n}})}(\mathsf{d}\mathsf{u})\right)^{1/2}\right]. \end{split} \tag{31}$$

*In particular, if the Bayesian risk function* R<sup>Θ</sup> *is optimized by choosing T*† *<sup>n</sup>*,*<sup>m</sup>* = *T*ˆ *<sup>n</sup>*,*m, where T*ˆ *<sup>n</sup>*,*<sup>m</sup> is as in* (15)*, and U*† *<sup>n</sup>*,*<sup>m</sup> is chosen equal to U*<sup>∗</sup> *n*,*m, where U*<sup>∗</sup> *<sup>n</sup>*,*<sup>m</sup> is as in* (13)*, then* (31) *becomes*

RU[*U*<sup>∗</sup> *<sup>n</sup>*,*m*] <sup>≤</sup> <sup>R</sup>Θ[*T*<sup>ˆ</sup> *<sup>n</sup>*,*m*] + E U d2 U(*U*<sup>∗</sup> *<sup>n</sup>*,*m*, *u*)*γT*<sup>ˆ</sup> *n*,*m*,(*X*1,...,*Xn*)(d*u*) + 2E <sup>L</sup>1/2 Θ,(*X*1,...,*Xn*) (*T*, *T*ˆ *<sup>n</sup>*,*m*) U d2 U(*U*<sup>∗</sup> *<sup>n</sup>*,*m*, *u*)*γT*<sup>ˆ</sup> *n*,*m*,(*X*1,...,*Xn*)(d*u*) 1/2 = inf *T*† *n*,*m* RΘ[*T*† *<sup>n</sup>*,*m*] + <sup>E</sup> inf *U*† *n*,*m* U d2 U(*U*† *<sup>n</sup>*,*m*, *u*)*γT*<sup>ˆ</sup> *n*,*m*,(*X*1,...,*Xn*)(d*u*) + 2E <sup>L</sup>1/2 Θ,(*X*1,...,*Xn*) (*T*, *T*ˆ *<sup>n</sup>*,*m*) U d2 U(*U*<sup>∗</sup> *<sup>n</sup>*,*m*, *u*)*γT*<sup>ˆ</sup> *n*,*m*,(*X*1,...,*Xn*)(d*u*) 1/2 . (32)

As an immediate remark, notice that the last member of (32) is obtained by first optimizing the risk R<sup>Θ</sup> with respect to the choice of *T*† *<sup>n</sup>*,*<sup>m</sup>* and then, after getting *T*ˆ *<sup>n</sup>*,*m*, the term E <sup>U</sup> d2 U(*U*† *<sup>n</sup>*,*m*, *u*)*γT*<sup>ˆ</sup> *n*,*m*,(*X*1,...,*Xn*)(d*u*) is optimized with respect to the choice of *U*† *<sup>n</sup>*,*m*. Of course, it can be argued about the convenience of this procedure—and it is actually due—even if, in most problems, it seems that the strategy proposed in Proposition 2 proves indeed to be the simplest and the most feasible one, above all if computational issues are taken into account. In fact, the absolute best theoretical strategy—consisting of optimizing the right-hand side of (31) jointly with respect to the choice of (*U*† *<sup>n</sup>*,*m*, *T*† *<sup>n</sup>*,*m*)—turns out to be very often too complex and onerous to carry out. Therefore, it seems reasonable to quantify, at least approximately, how far the strategy of Proposition 2 is from absolute optimality, in terms of efficiency. Finally, the additional term

$$2\mathbb{E}\left[\mathcal{L}\_{\Theta,(\mathcal{X}\_{1},\ldots,\mathcal{X}\_{\text{n}})}^{1/2}(T,\hat{T}\_{\text{n},\text{w}})\left(\int\_{\mathcal{U}}\mathrm{d}\_{\text{U}}^{2}(\mathcal{U}\_{\text{n},\text{w}}^{\*}\,\mu)\,\gamma\_{\mathcal{T}\_{\text{n},\text{w}}(\mathcal{X}\_{1},\ldots,\mathcal{X}\_{\text{n}})}(\text{d}\mu)\right)^{1/2}\right] \tag{33}$$

will be reconsidered in next statement, within the simplified setting of Section 2.2. Indeed, by arguing asymptotically, it will be shown that it is essentially negligible, proving in this way a sort of "Pythagorean inequality".

Henceforth, to make the above remark effective, we will formulate the subsequent results within the simplified setting introduced in Section 2.2. Indeed, **Steps 1–3** mentioned in the Introduction are worthy of being reconsidered in light of Proposition 2. On the one hand, **Steps 1** and **2** boil down to checking the existence and uniqueness of the barycenters appearing in (15) and (13), for instance by using the results contained in [24–26]. On the other hand, **Step 3** hinges on the validity of (2) and (3), which are somewhat related to inequality (32). More precisely, (2) will be proved directly by resorting to identity (29), while (3) will be obtained by estimating the right-hand side of (32). Here is a precise statement.

**Proposition 3.** *Besides Assumptions <sup>1</sup> and* (18)*, suppose that <sup>p</sup>* <sup>&</sup>gt; <sup>0</sup> *and <sup>p</sup>* <sup>∈</sup> C1(Θ)*,* <sup>m</sup>, <sup>v</sup> <sup>∈</sup> C2(Θ)*,* <sup>Δ</sup><sup>2</sup> <sup>∈</sup> <sup>C</sup>2(Θ2)*, and <sup>κ</sup><sup>t</sup> is any element of* H ∩ C3(Θ) *with a unique minimum point at t* ∈ Θ*. Then, it holds*

$$\begin{split} &\int\_{\Theta} \left\{ \mathbf{v}(\theta) + \left[ \mathfrak{m}(\theta) \right]^2 \right\} \pi\_n^\*(\mathbf{d}\theta \mid -\kappa\_t) - \left( \int\_{\Theta} \mathfrak{m}(\theta) \pi\_n^\*(\mathbf{d}\theta \mid -\kappa\_t) \right)^2 \\ &= \mathbf{v}(t) + \frac{1}{\mathsf{m}\_t''(t)} \left\{ [\mathfrak{m}'(t)]^2 + \frac{1}{2} \mathbf{v}''(t) + \mathbf{v}'(t) \left[ \frac{\mathbf{p}'(t)}{p(t)} - \frac{1}{2} \frac{\mathbf{x}\_t'''(t)}{\mathbf{x}\_t''(t)} \right] \right\} + o\left(\frac{1}{n}\right) \\ &\int \mathbf{v}^{2\left(\frac{1}{2}\mathbf{u}(\alpha) - \mathbf{v}(1,0) + \alpha\right)} \end{split} \tag{34}$$

$$\begin{split} & \int\_{\Theta} \Delta^{2}(\theta,\tau) \pi\_{n}^{\*} (\mathrm{d}\theta \mid -\kappa\_{t}) \\ & \frac{1}{n} = \Delta^{2}(t,\tau) + \frac{1}{n\kappa\_{t}^{''}(t)} \left\{ \frac{1}{2} \frac{\partial^{2}}{\partial\theta^{2}} \Delta^{2}(\theta,\tau)\_{\left|\theta \right|=t} + \frac{\partial}{\partial\theta} \Delta^{2}(\theta,\tau)\_{\left|\theta = t} \left[ \frac{p'(t)}{p(t)} - \frac{1}{2} \frac{\kappa\_{t}^{'''}(t)}{\kappa\_{t}^{''}(t)} \right] \right\} + o\left(\frac{1}{n}\right) \tag{35} \end{split} \tag{36}$$

*as n* → +∞*, for any τ* ∈ Θ*.*

Here, it is worth noticing that the asymptotic expansions derived in the above proposition are obtained by means of the Laplace method, as first proposed in [39]. See also [40] [Chapter 20]. At this stage, we face the problem of optimizing the left-hand side of (35) with respect to *τ*. Since the explicit expression of Δ2(*t*, *τ*) will be hardly known in closed form, a reasonable strategy considers, for fixed *t* ∈ Θ, the optimization of the righthand side of (35) with respect to *<sup>τ</sup>*, disregarding the remainder term *<sup>o</sup>*(1/*n*). If <sup>Δ</sup><sup>2</sup> <sup>∈</sup> <sup>C</sup>3(Θ2), this attempt leads to considering the equation

$$\frac{\partial}{\partial \tau} \left[ \Delta^2(t, \tau) + \frac{1}{n \mathbf{x}\_t''(t)} \left\{ \frac{1}{2} \frac{\partial^2}{\partial \theta^2} \Delta^2(\theta, \tau)\_{|\theta = t} + \frac{\partial}{\partial \theta} \Delta^2(\theta, \tau)\_{|\theta = t} \left[ \frac{p'(t)}{p(t)} - \frac{1}{2} \frac{\mathbf{x}\_t'''(t)}{\mathbf{x}\_t''(t)} \right] \right\} \right] = 0 \tag{36}$$

and, since

$$\frac{\partial}{\partial \mathbf{r}} \Delta^2 (t, \mathbf{r})\_{\big|\_{\mathbf{r}=\mathbf{t}}} = \mathbf{0} \,, \tag{37}$$

we have that any solution of (36) is of the form *τ*ˆ*<sup>n</sup>* = *t* + *n*, with some *<sup>n</sup>* that goes to zero as *n* → +∞. For completeness, the validity of (37) could be obtained by using the explicit expression of the Wasserstein distance due to Dall'Aglio. See [41].

If <sup>Δ</sup><sup>2</sup> <sup>∈</sup> C4(Θ2), we can plug the expression of *<sup>τ</sup>*ˆ*<sup>n</sup>* into (35), and expand further the right-hand side. Exploiting that

$$\begin{split} \Delta^{2}(t,\boldsymbol{\hat{\tau}}\_{n}) &= \frac{1}{2} \frac{\partial^{2}}{\partial\tau^{2}} \Delta^{2}(t,\boldsymbol{\tau}) \big|\_{\begin{subarray}{c} \tau=t \end{subarray}} \cdot \boldsymbol{\epsilon}\_{n}^{2} + o(\epsilon\_{n}^{2}) \\ \frac{\partial}{\partial t} \Delta^{2}(t,\boldsymbol{\hat{\tau}}\_{n}) &= \frac{\partial}{\partial\tau} \Big[ \frac{\partial}{\partial t} \Delta^{2}(t,\boldsymbol{\tau}) \bigg|\_{\begin{subarray}{c} \tau=t \end{subarray}} \cdot \boldsymbol{\epsilon}\_{n} + \frac{1}{2} \frac{\partial^{2}}{\partial\tau^{2}} \Big[ \frac{\partial}{\partial t} \Delta^{2}(t,\boldsymbol{\tau}) \bigg|\_{\begin{subarray}{c} \tau=t \end{subarray}} \cdot \boldsymbol{\epsilon}\_{n}^{2} + o(\epsilon\_{n}^{2}) \Big] \\ \frac{\partial^{2}}{\partial t^{2}} \Delta^{2}(t,\boldsymbol{\tau}\_{n}) &= \frac{\partial^{2}}{\partial t^{2}} \Delta^{2}(t,\boldsymbol{\tau}) \big|\_{\begin{subarray}{c} \tau=t \end{subarray}} + \frac{\partial}{\partial\tau} \Big[ \frac{\partial^{2}}{\partial t^{2}} \Delta^{2}(t,\boldsymbol{\tau}) \bigg|\_{\begin{subarray}{c} \tau=t \end{subarray}} \cdot \boldsymbol{\epsilon}\_{n} \\ &+ \frac{1}{2} \frac{\partial^{2}}{\partial\tau^{2}} \Big[ \frac{\partial^{2}}{\partial t^{2}} \Delta^{2}(t,\boldsymbol{\tau}) \bigg|\_{\begin{subarray}{c} \tau=t \end{subarray}} \cdot \boldsymbol{\epsilon}\_{n}^{2} + o(\epsilon\_{n}^{2}), \end{split}$$

we get

$$\int\_{\Theta} \Delta^{2}(\theta, \mathfrak{t}\_{n}) \pi\_{n}^{\*} (\mathrm{d}\theta \mid -\kappa\_{t})$$

$$= \frac{1}{2} \frac{\partial^{2}}{\partial \tau^{2}} \Delta^{2}(t, \tau) \mathop{\mid}\_{|\tau - t|} \cdot \epsilon\_{n}^{2} + \frac{1}{n \kappa\_{t}''(t)} \left[ \frac{1}{2} \frac{\partial^{2}}{\partial t^{2}} \Delta^{2}(t, \tau) \mathop{\mid}\_{|\tau - t|} + \frac{1}{2} \frac{\partial}{\partial \tau} \left[ \frac{\partial^{2}}{\partial t^{2}} \Delta^{2}(t, \tau) \right] \mathop{\mid}\_{|\tau - t|} \cdot \epsilon\_{n}$$

$$\begin{split} + \frac{1}{4} \frac{\partial^{2}}{\partial \tau^{2}} \left[ \frac{\partial^{2}}{\partial t^{2}} \Delta^{2}(t, \tau) \right] \mathop{\mid}\_{|\tau - t|} &\epsilon\_{n}^{2} + \left[ \frac{p'(t)}{p(t)} - \frac{1}{2} \frac{\kappa\_{t}'''(t)}{\kappa\_{t}(t)} \right] \cdot \frac{\partial}{\partial \tau} \left[ \frac{\partial}{\partial t} \Delta^{2}(t, \tau) \right] \mathop{\mid}\_{|\tau - t|} \cdot \epsilon\_{n} \\ &+ \frac{1}{2} \left[ \frac{p'(t)}{p(t)} - \frac{1}{2} \frac{\kappa\_{t}'''(t)}{\kappa\_{t}(t)} \right] \cdot \frac{\partial^{2}}{\partial \tau^{2}} \left[ \frac{\partial}{\partial t} \Delta^{2}(t, \tau) \right] \mathop{\mid}\_{|\tau - t|} \cdot \epsilon\_{n}^{2} \end{split}$$

The right-hand side of this expression has the form

$$\left[a \cdot \epsilon\_n^2 + \frac{1}{n} \left[A \cdot \epsilon\_n^2 + B \cdot \epsilon\_n + \mathcal{C}\right] + o(\epsilon\_n^2) + o\left(\frac{1}{n}\right) \cdot \epsilon\_n\right]$$

so that the choice

$$\epsilon\_{\rm nl} = -\frac{B}{2na} \left( 1 + \frac{A}{na} \right) + o\left(\frac{1}{n^2}\right) = -\frac{B}{2na} + o\left(\frac{1}{n}\right).$$

optimizes its expression. Whence,

$$\begin{split} \mathbf{f}\_{n} &= t - \frac{1}{n\kappa\_{t}^{''}(t)} \left( \frac{\partial^{2}}{\partial\tau^{2}} \Delta^{2}(t,\tau)|\_{\tau=t} \right)^{-1} \cdot \left\{ \frac{1}{2} \frac{\partial}{\partial\tau} \left[ \frac{\partial^{2}}{\partial t^{2}} \Delta^{2}(t,\tau) \right]\_{\Big|\tau=t} \right. \\ &\left. + \left[ \frac{p'(t)}{p(t)} - \frac{1}{2} \frac{\kappa\_{t}^{'''}(t)}{\kappa\_{t}^{''}(t)} \right] \cdot \frac{\partial}{\partial\tau} \left[ \frac{\partial}{\partial t} \Delta^{2}(t,\tau) \right]\_{\Big|\tau=t} \right\} + o\left(\frac{1}{n}\right) \end{split} \tag{39}$$

and consequently

$$\int\_{\Theta} \Delta^2(\theta, \mathfrak{k}\_{\mathbb{H}}) \pi\_{\mathfrak{n}}^\*(\mathsf{d}\theta \mid -\infty) = \frac{1}{2n\kappa\_t''(t)} \frac{\partial^2}{\partial t^2} \Delta^2(t, \mathfrak{r}) \Big|\_{\mathbb{T}^{\mathfrak{s}t}} + o\left(\frac{1}{n}\right). \tag{40}$$

A first consequence of these computations is that the (Bayesian) estimator *T*ˆ *n*,*m* in (15) has the same form as (39) with *t* and *κ<sup>t</sup>* replaced by the MLE, denoted by ˆ *θn*, and −*Hn*, respectively. Of course, this fact has some relevance only in the case that ˆ *θ<sup>n</sup>* exists and is unique. Moreover, coming back to (32), it is worth noticing that

$$\inf\_{\mathsf{L}\downarrow\mathsf{L}\_{n,m}^{\mathsf{L}}} \int\_{\mathsf{U}} \left| \mathsf{L}\!\!^{\mathsf{L}}\_{n,m} - \mathsf{u} \right|^{2} \gamma\_{\mathsf{f}\_{n}}(\mathsf{d}\mathsf{u}) = \mathsf{v}(\mathsf{f}\_{n}) = \mathsf{v}(\mathsf{t}) + \mathsf{v}'(\mathsf{t})\epsilon\_{n} + o\left(\frac{1}{n}\right),\tag{41}$$

where we have dropped the dependence on (*X*1,..., *Xn*) in the expression of *γτ*<sup>ˆ</sup>*<sup>n</sup>* , in agreement with the simplified setting of Section 2.2 we are following. The last preliminary remark is about the additional term (33) that appears in the last member of (32). In fact, exploiting from the beginning that U = R and LU(*u*, *v*) = |*u* − *v*| 2, we find that it reduces to

$$2\mathbb{E}\left[\int\_{\Theta}\int\_{\mathcal{X}^{\mathsf{m}}}\left(u\_{m}(\mathbf{y},\theta)-u\_{m}(\mathbf{y},\hat{T}\_{n,\mathsf{m}})\right)\Big(u\_{m}(\mathbf{y},\hat{T}\_{n,\mathsf{m}})-\mathsf{m}(\hat{T}\_{n,\mathsf{m}})\right)\times\\\times\mu^{\odot\_{\mathsf{m}}}(\mathbf{dy}\mid\theta)\pi\_{\mathsf{n}}(\mathsf{d}\theta\mid\mathcal{X}\_{1},\ldots,\mathcal{X}\_{n})\Big]\tag{42}$$

by which we notice that it also involves "covariance terms". The way is now paved to state the following

**Theorem 2.** *Besides Assumptions <sup>1</sup> and* (18)*, suppose that* <sup>m</sup>, <sup>v</sup> <sup>∈</sup> <sup>C</sup>2(Θ) *and* <sup>Δ</sup><sup>2</sup> <sup>∈</sup> C4(Θ2)*. Then, the identities*

$$\frac{\partial^2}{\partial \tau^2} \Delta^2(t, \tau)|\_{\tau=t} = -\frac{\partial}{\partial \tau} \left[ \frac{\partial}{\partial t} \Delta^2(t, \tau) \right]\_{\Big|\tau=t} \tag{43}$$

$$\begin{split} \frac{1}{2}\mathbf{\hat{v}}''(t) + [\mathbf{m}'(t)]^2 &= \frac{1}{2} \frac{\partial^2}{\partial t^2} \boldsymbol{\Delta}^2(t,\boldsymbol{\tau})|\_{\boldsymbol{\tau}=t} \\ &- \frac{1}{2} \left( \frac{\partial^2}{\partial \boldsymbol{\tau}^2} \boldsymbol{\Delta}^2(t,\boldsymbol{\tau})|\_{\boldsymbol{\tau}=t} \right)^{-1} \frac{\partial}{\partial \boldsymbol{\tau}} \left[ \frac{\partial^2}{\partial t^2} \boldsymbol{\Delta}^2(t,\boldsymbol{\tau}) \right]\_{\boldsymbol{\tau}=t} \boldsymbol{\nu}'(t) \end{split} \tag{44}$$

*entail that*

$$\int\_{\Theta} \left\{ \mathbf{v}(\theta) + [\mathfrak{m}(\theta)]^2 \right\} \pi\_n^\*(\mathbf{d}\theta \mid -\kappa\_l) - \left( \int\_{\Theta} \mathfrak{m}(\theta) \pi\_n^\*(\mathbf{d}\theta \mid -\kappa\_l) \right)^2$$

$$\pi\_n^\* = \int\_{\Theta} \Delta^2(\theta, \mathfrak{k}\_{\mathfrak{n}}) \pi\_n^\*(\mathbf{d}\theta \mid -\kappa\_l) + \mathfrak{v}(t) + \mathbf{v}'(t) \epsilon\_{\mathfrak{n}} + o\left(\frac{1}{n}\right) \tag{45}$$

*for any <sup>t</sup>* <sup>∈</sup> <sup>Θ</sup>*, any <sup>κ</sup><sup>t</sup> in* H ∩ C3(Θ) *with a unique minimum point at <sup>t</sup>* <sup>∈</sup> <sup>Θ</sup>*, and any <sup>p</sup>* <sup>&</sup>gt; <sup>0</sup> *with <sup>p</sup>* <sup>∈</sup> C1(Θ)*, provided that the term in* (42) *is of o*( <sup>1</sup> *<sup>n</sup>* )*-type. Thus, if either*

*(A1)* (26) *holds uniformly with respect to some class* F *of continuous functionals* Ψ : H → R*, in the sense that*

$$\sup\_{\mathbf{V}\in\mathcal{S}}\left|\int\_{\mathcal{H}}\Psi(h)\nu\_{n}(\mathbf{d}h) = \int\_{\mathcal{H}}\Psi(h)\nu(\mathbf{d}h) + \frac{1}{n}\int\_{\mathcal{H}}\Psi(h)\Gamma(h;\mu,\pi)\nu(\mathbf{d}h)\right| = o\left(\frac{1}{n}\right)$$

*(A2) both the functionals h* <sup>→</sup> Θ{v(*θ*)+[m(*θ*)]2}*π*<sup>∗</sup> *<sup>n</sup>*(d*<sup>θ</sup>* <sup>|</sup> *<sup>h</sup>*) <sup>−</sup> <sup>Θ</sup> m(*θ*)*π*<sup>∗</sup> *<sup>n</sup>*(d*θ* | *h*) <sup>2</sup> *and h* → infT† *n*,*m* Θ[Δ(*θ*,T† *n*,*m*(*h*))]2*π*<sup>∗</sup> *<sup>n</sup>*(d*θ* | *h*) *belong to* F*, for all n* ∈ N

*or*


*then* (2)*–*(3) *are in force with*

$$\begin{split} \mathcal{R}\_{0,m} = R\_{0,m}^\* &= \int\_{\Theta} \mathbf{v}(t) \pi(\mathbf{d}t) \\ \mathcal{R}\_{1,m} = R\_{1,m}^\* &= \int\_{\Theta} \frac{1}{\overline{\pi}\_t''(t)} \left\{ [\mathbf{m}'(t)]^2 + \frac{1}{2} \mathbf{v}''(t) + \mathbf{v}'(t) \left[ \frac{\mathbf{p}'(t)}{p(t)} - \frac{1}{2} \frac{\overline{\pi}\_t''(t)}{\overline{\pi}\_t''(t)} \right] \right\} \pi(\mathbf{d}t) \\ &+ \int\_{\Theta} \mathbf{v}(t) \Gamma(\overline{\pi}\_t; \boldsymbol{\mu}, \boldsymbol{\pi}) \pi(\mathbf{d}t), \end{split} \tag{47}$$

*where <sup>κ</sup>t*(*θ*) :<sup>=</sup> <sup>K</sup>(*<sup>t</sup>* || *<sup>θ</sup>*)*, for any p* <sup>&</sup>gt; <sup>0</sup> *with p* <sup>∈</sup> <sup>C</sup>1(Θ)*.*

As announced in the Introduction, here we have minted the term *compatibility equations* to refer to identities (43) and (44). They actually constitute two "compatibility conditions" that involve only the statistical model, without any mention to the prior. The dependence on the quantity to be estimated is indeed hidden in the expression of Δ2. More deeply, these equations can be viewed as a check on the compatibility between the original estimation problem (1) and the fact that we have metrized the space of the parameters Θ as in (20). Actually, they could have a more general value if interpreted as relations aimed at *characterizing* Δ2, rather than imposing that this distance is given in terms of the Wasserstein distance as in (20). However, for a distance that is characterized differently from (20), an analogous of inequality (32) should be checked in terms of this new distance on Θ. As to the concrete check of the compatibility equations, we notice that the former identity (43) is generally valid as a consequence of the representation formula or the Wasserstein distance

due to Dall'Aglio (see [41]), as long as the exchange between derivatives and integrals is allowed. For the other identity (44), we have instead collected in Section 4 some examples of simple statistical models for which its verification proves to be quite simple. Finally, the issue of extending these equations in a higher dimension, including the infinite dimension, is deferred to Section 6.

Apropos of the other assumptions, the verification that the term in (42) is of *o*( <sup>1</sup> *<sup>n</sup>* )-type is generally straightforward. For instance, such a term is even equal to zero if *um* is independent of *θ*. As to the two groups of assumptions which are needed to prove (46) and (47), the latter block, formed by (B1) and (B2), is certainly easier to check. However, (B1) and (B2) can prove to be rather strong since they require the existence of the MLE for any *n* ∈ N. On the other hand, checking (A1) and (A2) is generally harder since it constitutes a strong reinforcement of de Finetti's Law of Large Number for the log-likelihood process, similar in its conception to those stated in [35,36]. Moreover, the check of (A2) is more or less equivalent to prove a uniform regularity of the mapping *h* → *π*<sup>∗</sup> *<sup>n</sup>*(d*θ* | *h*), as a map from H into the space of p.m.'s on (Θ, T ) metrized with a Wasserstein distance. This theory is presented and developed in [42,43]. In any case, these lines of research deserve further investigations, to be deferred to a forthcoming paper.

Finally, we consider **Steps 4–6** mentioned in the Introduction, in light of the previous results. In fact, the compatibility Equations (43) and (44) suggest two new compatibility conditions, which are necessary to get (10) along with *R*ˆ*i*,*<sup>m</sup>* = *R*˜*i*,*<sup>m</sup>* for *i* = 0, 1. A formal statement reads as follows.

**Theorem 3.** *Besides Assumptions <sup>1</sup> and* (18)*, suppose that* <sup>m</sup>, <sup>v</sup> <sup>∈</sup> <sup>C</sup>2(Θ)*,* <sup>Δ</sup><sup>2</sup> <sup>∈</sup> <sup>C</sup>4(Θ2)*. Assume also that either (A1) and (A2) or (B1) and (B2) of* Theorem *2 are in force. Then, any solution τ*ˆ*<sup>n</sup> of the following equations:*

v(ˆ *θn*) = v(*τ*ˆ*n*) + Δ2(*τ*ˆ*n*, ˆ *θn*) + *o* 1 *n* (48)

$$\mathbf{v}'(\hat{\theta}\_{\text{tr}}) = \frac{\partial}{\partial t} \boldsymbol{\Delta}^2(\hat{\mathbf{r}}\_{\text{tr}}, t)|\_{t=\hat{\theta}\_{\text{tr}}} + o\left(\frac{1}{n}\right) \tag{49}$$

$$\frac{1}{2}\boldsymbol{\nabla}^{\prime}(\boldsymbol{\hat{\theta}}\_{n}) + [\boldsymbol{m}^{\prime}(\boldsymbol{\hat{\theta}}\_{n})]^{2} = \frac{1}{2}\frac{\partial^{2}}{\partial t^{2}}\boldsymbol{\Delta}^{2}(\boldsymbol{\hat{\pi}}\_{n},t)\_{\big|\_{t}=\boldsymbol{\theta}\_{n}} + o\left(\frac{1}{n}\right),\tag{50}$$

*where* ˆ *θ<sup>n</sup> stands for the MLE, yields a prior-free estimator T*ˆ *<sup>n</sup>*,*<sup>m</sup> and, through Step 5, another prior-free estimator U*˜ *<sup>n</sup>*,*<sup>m</sup> that satisfies* (10) *along with R*ˆ*i*,*<sup>m</sup>* = *R*˜*i*,*<sup>m</sup> for i* = 0, 1*, where R*ˆ 0,*<sup>m</sup> and R*ˆ 1,*<sup>m</sup> are as in* (46) *and* (47)*, respectively, provided that the term in* (42) *is of o*( <sup>1</sup> *<sup>n</sup>* )*-type.*

The derivation of new prior free-estimators via this procedure represents a novel line of research that we would like to pursue in forthcoming works.

#### **4. Applications and Examples**

This section is split into four subsections, and has two main purposes. In fact, Sections 4.1–4.3 just contain explicit examples of very simple statistical models for which the compatibility equations are satisfied. These models are the one-dimensional Gaussian, the exponential and the Pareto model. Section 4.4 has a different nature, since it is devoted to a more concrete application of our approach to the original Poisson-mixture setting used by Herbert Robbins to introduce his own approach to empirical Bayes theory. Finally, Section 4.5 carries on the discussion initiated in Section 4.4 by showing a concrete application relative to one year of claims data for an automobile insurance company.

## *4.1. The Gaussian Model*

Here, we have X = Θ = R and

$$\mu(A \mid \theta) = \int\_{A} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\{-\frac{1}{2\sigma^2}(\mathbf{x} - \theta)^2\} \mathbf{dx} \qquad (A \in \mathcal{A}(\mathbb{R})),$$

for some known *σ*<sup>2</sup> > 0. For simplicity, we put *m* = 1 and *u*1(*y*, *θ*) = *y*, which is tantamount to saying that the original predictive aim was focused on the estimation of *Xn*+1. In this setting, it is very straightforward to check that m(*θ*) = *θ* and v(*θ*) = *σ*2. Moreover, in view of well-know computations on the Wasserstein distance (see [44,45]), it is also straightforward to check that <sup>Δ</sup>2(*θ*, *<sup>τ</sup>*) = <sup>|</sup>*<sup>θ</sup>* <sup>−</sup> *<sup>τ</sup>*<sup>|</sup> 2. Therefore, (43) becomes 2 = 2, while (44) reduces to 1 = 1. Finally, it is also possible to check the validity of (48)–(50) with the simplest choice *τ*ˆ*<sup>n</sup>* = ˆ *θn*.

The case of constant mean and unknown variance will not be dealt with here because its treatment is substantially included in the following subsection. Apropos of the multidimensional variant of this model, very important in many statistical applications, we just mention the interesting paper [46] which paves the way, mathematically speaking, to write down the multidimensional analogous of the compatibility equations in a full Riemannian context.

## *4.2. The Exponential Model*

Here, we have X = Θ = (0, ∞) and

$$\mu(A \mid \theta) = \int\_A \theta e^{-\theta x} \mathrm{d}x \qquad (A \in \mathcal{A}(0, +\infty))\,\,\_2$$

Again, for simplicity, we put *m* = 1 and *u*1(*y*, *θ*) = *y*, which is tantamount to saying that the original predictive aim was focused on the estimation of *Xn*+1. In this setting, it is very straightforward to check that m(*θ*) = 1/*θ* and v(*θ*) = 1/*θ*2. Moreover, by resorting to Dall'Aglio representation of the Wasserstein distance (see [41]), it is also straightforward to check that <sup>Δ</sup>2(*θ*, *<sup>τ</sup>*) = <sup>2</sup>|1/*<sup>θ</sup>* <sup>−</sup> 1/*τ*<sup>|</sup> 2. Although very simple, this is a very interesting example of non-Euclidean distance on Θ = (0, ∞). As to the validity of the compatibility equations, we easily see that (43) yields 4/*t* <sup>4</sup> = 4/*t* 4, while (44) becomes:

$$\left(\frac{3}{t^4} + \left(\frac{1}{t^2}\right)^2 - \frac{1}{2} \cdot \frac{4}{t^4} - \frac{1}{2}\left(\frac{8}{t^5}\right) \cdot \left(\frac{4}{t^4}\right)^{-1} \cdot \left(-\frac{2}{t^3}\right) \dots\right)$$

*4.3. The Pareto Model*

Here, we have X = Θ = (0, ∞) and

$$\mu(A \mid \theta) = \int\_{A \cap (\theta, +\infty)} \frac{a\theta^a}{\mathfrak{x}^{a+1}} \mathrm{d}x \qquad (A \in \mathcal{A}(0, +\infty))$$

for some known *α* > 2. Again, for simplicity, we put *m* = 1 and *u*1(*y*, *θ*) = *y*, which is tantamount to saying that the original predictive aim was focused on the estimation of *Xn*+1. In this setting, it is very straightforward to check that m(*θ*) = *<sup>α</sup> <sup>α</sup>*−<sup>1</sup> *<sup>θ</sup>* and <sup>v</sup>(*θ*) = *<sup>α</sup>* (*α*−2)(*α*−1)<sup>2</sup> *<sup>θ</sup>*2. Moreover, by resorting to the Dall'Aglio representation of the Wasserstein distance (see [41]), it is also straightforward to check that Δ2(*θ*, *τ*) = *<sup>α</sup> <sup>α</sup>*−<sup>2</sup> <sup>|</sup>*<sup>θ</sup>* <sup>−</sup> *<sup>τ</sup>*<sup>|</sup> 2. Of course, this is not a regular model since the support of *μ*(·|*θ*) varies with *θ*. Anyway, it is interesting to notice that the compatibility equations are still also valid in this case. Therefore, the analysis of such non-regular models should motivate further investigations about their intrinsic value.

#### *4.4. Robbins Approach to Empirical BAYES*

In his seminal paper [23], Herbert Robbins introduced the following model to present his own approach to empirical Bayes theory. The problem that he considers is inspired by car insurance data analysis, and it is only slightly different from a "standard" predictive problem. We start by putting X = N<sup>2</sup> <sup>0</sup> and U = N0, and considering exchangeable random variables *Xi*'s with *Xi* = (*ξi*, *ηi*). The practical meaning is that *ξ<sup>i</sup>* represents the number of accidents experienced by the *i*-th customer in the past year, while *η<sup>i</sup>* represents the number of accidents that the same *i*-th customer will experience in the current year. Then, Robbins (in his own notation) attaches to each customer a random parameter, say *λ<sup>i</sup>* > 0 to the *i*-th customer, which represents the rate of a Poisson distribution for that customer. Moreover, he considers the *λi*'s as i.i.d. and, conditionally on the *λi*'s, the *Xi*'s become independent, and in addition *ξ<sup>i</sup>* and *η<sup>i</sup>* become i.i.d. with distribution Poi(*λi*), for all *i* ∈ N. Robbins calls *G* the common distribution of the *λi*'s and interpret it as a "prior distribution". However, if we strictly follow the Bayesian main way, we should call this distribution *θ* to avoid confusion, and just realize that we have, this way, defined the statistical model, that is

$$\mu(\{(k,h)\}\mid\theta) = \int\_0^{+\infty} \frac{e^{-z}z^k}{k!} \frac{e^{-z}z^h}{h!} \theta(\mathrm{d}z) \qquad ((k,h)\in\mathbb{N}\_0^2) \,. \tag{51}$$

Thus, the actual prior (Bayesianly speaking) is some p.m. *π* on the space of all p.m.'s on ((0, +∞), B(0, +∞)), while the random parameter *T* considered in the present paper is some *random probability measure*. Here, the objective—actually very practical and intuitively logic—is to estimate *η*<sup>1</sup> on the basis of the sample (*ξ*1, ... , *ξn*). Thus, our *Un*,*<sup>m</sup>* coincides with *η*<sup>1</sup> and the loss function is just, as usual, the quadratic loss. Throughout his paper, Robbins works under the conditioning to *T* = *θ* (that his under a fixed prior, in his own terminology). Hence, his "theoretical estimator" reads

$$\mathbb{E}\_{\theta}[\eta\_1 \mid (\xi\_1, \dots, \xi\_n)] = \mathbb{E}\_{\theta}[\eta\_1 \mid \xi\_1] = (\xi\_1 + 1) \frac{p\_{\theta}(\xi\_1 + 1)}{p\_{\theta}(\xi\_1)},\tag{52}$$

where *p<sup>θ</sup>* (*k*) := *μ*({*k*} × N<sup>0</sup> | *θ*). To get rid of the unobservable *θ*, Robbins exploits that *θ* = E*<sup>θ</sup>* [*ξ*1] = ∑+<sup>∞</sup> *<sup>k</sup>*=<sup>0</sup> *kp<sup>θ</sup>* (*k*) to bring the Strong Law of Large Numbers into the game. Indeed, since

$$\mathfrak{p}(k) := \frac{1}{n} \sum\_{i=1}^{n} \mathbf{1} \{ \xi\_i^r = k \} \stackrel{\mathbb{P}\_\theta - a.s.}{\longrightarrow} p\_\theta(k).$$

holds for any *θ*, then it could be worth considering the (prior-free) estimator:

$$
\Omega\_{n,m} = (\mathfrak{f}\_1 + 1) \frac{\mathfrak{f}(\mathfrak{f}\_1 + 1)}{\mathfrak{f}(\mathfrak{f}\_1)} \,. \tag{53}
$$

At this stage, if we want to maintain the Bayesian main way, we should make three basic considerations. First, given the statistical model (51), independently of the estimation problem, the assumption of exchangeability of the *Xi*'s entails the existence of some prior distribution *π*, by de Finetti's representation theorem. Second, given the quadratic loss function on U, the best (i.e., the most efficient) estimator is given by:

$$\hat{\mathcal{U}}\_{\mathfrak{n},\mathfrak{m}} := \mathbb{E}[\eta\_1 \mid (\mathfrak{f}\_1, \dots, \mathfrak{f}\_{\mathfrak{n}})]\_{\mathfrak{m}}$$

where the expectation E depends of course on the prior *π*. Third, if we consider the above estimator as useless, because of an effective ignorance about the prior *π*, we are justified to consider the above *U*˜ *<sup>n</sup>*,*<sup>m</sup>* as a possible approximation of *U*ˆ *<sup>n</sup>*,*m*, in the sense expressed by the joint validity of (2) and (10), with *R*ˆ*i*,*<sup>m</sup>* = *R*˜*i*,*<sup>m</sup>* for *i* = 0, 1, uniformly with respect to a whole (possible very large) class of priors *π*. Unfortunately, it is not the case. Or rather, we could actually achieve this goal, in the presence of distinguished choices of *π*. Therefore, if there is ignorance on *π*, we can only consider the Robbins estimator as efficient "at zero-level", and not also "at *O*( <sup>1</sup> *<sup>n</sup>* )-level". If we follow the approach presented in this paper, the natural choice for an estimator is given by:

$$\mathcal{U}\_{n,m}^{\*} = (\pounds\_1 + 1) \frac{\int\_0^{+\infty} \left(\frac{e^{-z} z^{\frac{\pi}{2}\_1 + 1}}{\left(\frac{\pi}{2}\_1 + 1\right)!}\right) \mathcal{T}\_{n,m}(\mathbf{d}z)}{\int\_0^{+\infty} \left(\frac{e^{-z} z^{\frac{\pi}{2}\_1}}{\int\_{\frac{\pi}{2}\_1 1}!}\right) \mathcal{T}\_{n,m}(\mathbf{d}z)},\tag{54}$$

where the estimator *T*ˆ *<sup>n</sup>*,*<sup>m</sup>* belongs to the effective space of the parameters Θ, that is the space of all p.m.'s on ((0, +∞), B(0, +∞)), and is identified as:

$$\hat{T}\_{n,m} = \operatorname{Argmin}\_{\tau} \int\_{\Theta} \mathcal{W}\_2^2(\mu\_{\xi\_1\theta}, \mu\_{\xi\_1,\tau}) \pi\_n(\mathrm{d}\theta \mid \xi\_1, \dots, \xi\_n), \tag{55}$$

with

$$\mu\_{k, \theta}(A) := \frac{\int\_A \left(\frac{e^{-z} z^k}{k!} \right) \theta(\mathbf{dz})}{\int\_0^{+\infty} \left(\frac{e^{-z} z^k}{k!} \right) \theta(\mathbf{dz})} \qquad (A \in \mathcal{A}(0, +\infty))\,.$$

The proof of the fact that our estimator is more efficient than Robbins estimator—at least asymptotically and uniformly with respect to a whole class of priors—will be given in a forthcoming paper. Indeed, such a proof will constitute only a first step towards a complete vindication of our approach. The crowing achievement of the project would be represented by the production of some prior-free approximation of *T*ˆ *<sup>n</sup>*,*<sup>m</sup>* that could lead, through (54), to an efficient estimator *U*˜ *<sup>n</sup>*,*<sup>m</sup>* up to the "*O*( <sup>1</sup> *<sup>n</sup>* )-level". Research on this is ongoing.

## *4.5. An Example of Real Data Analysis*

This subsection represents a continuation of the analysis of Robbins' approach to empirical Bayes theory, hinting at some concrete applications. We display below a Table 1 from [47] which is relative to one year of claims data for a European automobile insurance company. The original source of the data is the actuarial work [48].

**Table 1.** Table reporting, in the second line, the exact counts of claimed accidents. Third and fourth lines display estimated numbers of accidents.


Here, a population of 9461 automobile insurance policy holders is considered. Out of these, 7840 made no claims during the year; 1317 made a single claim; 239 made two claims each and so forth, continuing to the one person who made seven claims. The insurance company is concerned about the claims each policy holder will make in the next year. The third and the fourth lines provide estimations of such numbers by following the original Robbins method (based on (53)) and another compound model discussed in Section 6.1 of [47], respectively. In particular, the Robbins estimator predicts that the 7840 policy holders that made no claims during the year will contribute to an amount of 7840 × 0.168 ≈ 1317 accidents, and so on. Analogously, the compound model predicts that the same 7840 policy holders will contribute to an amount of 7840 × 0.164 ≈ 1286 accidents, and so on. Moreover, it is worth noticing that the original Robbins estimator suffers the lack of certain regularity properties, such as monotonicity, so that various smoothed versions of it have been provided by other authors. See [49]. See also [50] [Chapter 5] for a comprehensive treatment.

Here, we seize the opportunity to give the reader a taste of our approach, as explained in Section 4.4. A detailed treatment would prove, in any case, too complex to be thoroughly developed in this paper, due to the significant amount of numerical techniques which are necessary to carry out our strategy. Indeed, the big issue is concerned with the implementation of the infinite-dimensional minimization problem (55), which is still under investigation. However, we can simplify the treatment by restricting the attention

on prior distributions *π* that put the total unitary mass, for example, on the set E of exponential distributions, so that *θ*(d*z*) = *βe*−*βz*d*z* for *z* > 0 and some hyper-parameter *β* > 0. Thus, given some hyper-prior *ζ* on the hyper-parameter *β*, we can easily see that (55) boils down to a simple, one-dimensional minimization problem. Its solution *T*ˆ *n*,*m* is provided by the distribution

$$
\beta\_n e^{-\beta\_n z} \mathrm{d}z
$$

with *β*ˆ*<sup>n</sup>* coinciding with the harmonic mean of the posterior distribution of the hyperparameter *β*. On the basis of the theory developed in the paper, this solution will prove asymptotically nearly optimal uniformly with respect to the (narrow) class of prior distributions that put the total unitary mass on E. Whence, the estimator *U*<sup>∗</sup> *<sup>n</sup>*,*<sup>m</sup>* in (54) assumes the form

$$
\mathcal{U}^\*\_{n,m} = \frac{\xi\_1^\* + 1}{\hat{\beta}\_n^\* + 1} \dots
$$

This last estimator is, of course, not prior-free, because *β*ˆ*<sup>n</sup>* depends on the prior *ζ*. However, to get a quick result, we can approximate *β*ˆ*<sup>n</sup>* by means of the Laplace methods again yielding

$$\frac{\xi\_1 + 1}{\beta\_n + 1} \approx (\xi\_1 + 1) \frac{S\_n}{S\_n + n} := \mathcal{U}\_{n, m, n}$$

where *Sn* represents the total amount of accidents. Since *n* = 9461 and *Sn* = 2028 in the dataset under consideration, we provide the following new Table 2,

**Table 2.** Table reporting, in the second line, the exact counts of claimed accidents. Third line displays estimated numbers of accidents.


which is indeed comparable with the previous one. To give an idea, the Robbins estimator predicts 2019 total accidents for the next year, while the estimator *U*˜ *<sup>n</sup>*,*<sup>m</sup>* above predicts 2033 total accidents for the next year.

In any case, a thorough analysis of this specific example deserves more attention, and will be developed in a forthcoming new paper.

#### **5. Proofs**

Gathered here are the proofs of the results stated in the main body of the paper.

## *5.1. Theorem 1*

First, by following the same line of reasoning as in [32], conclude that the sequence {*Hn*}*n*≥<sup>1</sup> is a Cauchy sequence in *<sup>L</sup>*2(Ω; <sup>H</sup>) :<sup>=</sup> {*<sup>W</sup>* : <sup>Ω</sup> →H| <sup>E</sup>[*W*<sup>2</sup> H] <sup>&</sup>lt; <sup>+</sup>∞}. Thus, by completeness, there exists a random element *<sup>H</sup>*<sup>∗</sup> in *<sup>L</sup>*2(Ω; <sup>H</sup>) such that *Hn <sup>L</sup>*<sup>2</sup> −→ *H*∗. Now, exploit the continuous embedding H ⊂ <sup>C</sup>0(Θ). By de Finetti's Strong Law of Large Numbers (see [9]), *Hn*(*θ*) converges <sup>P</sup>-a.s. to <sup>−</sup>K(*<sup>T</sup> <sup>θ</sup>*) + <sup>X</sup> log *f*(*z*| *T*)*μ*(d*z* | *T*) = *H*(*θ*), for any fixed *θ* ∈ Θ. Since *H* ∈ H by Assumption 1, then *H* = *H*<sup>∗</sup> as elements of H. At this stage the conclusion that *<sup>ν</sup><sup>n</sup>* <sup>⇒</sup> *<sup>ν</sup>* follows by the standard implication that *<sup>L</sup>*2-convergence implies convergence in distribution, which is still true for random elements taking values in a separable Hilbert space. See [51].

## *5.2. Proposition 1*

Start by considering a functional Ψ in C2 *<sup>b</sup>*(H). Notice that

$$\int\_{\mathcal{H}} \Psi(h)\nu\_n(\mathbf{d}h) = \mathbb{E}[\Psi(H\_n)]$$

and then expand the term Ψ(*Hn* − *H* + *H*) by the Taylor formula (see [33,34]) to get

$$\Psi(H\_{\mathbb{H}}) = \Psi(H) + \langle \nabla \Psi(H), H\_{\mathbb{H}} - H \rangle + \frac{1}{2} \langle \text{Hess} [\Psi]\_{\mathbb{H}} (H\_{\mathbb{H}} - H), H\_{\mathbb{H}} - H \rangle + o(\left( \| H\_{\mathbb{H}} - H \| ^2 \right).$$

Observe that *H* is *σ*(*T*)-measurable while, by de Finetti's representation theorem, the distribution of *Hn* − *H*, given *T*, coincides with the distribution of a sum of *n* i.i.d. random elements. Whence, the tower property of the conditional expectation entails:

$$\mathbb{E}[\mathbb{Y}(H\_n)] = \mathbb{E}[\mathbb{Y}(H)] + \frac{1}{2n} \mathbb{E}[\text{Hess}[\mathbb{Y}]\_H \otimes \text{Cov}\_T[\log f(X\_i|\cdot)]] + o\left(\frac{1}{n}\right)$$

since <sup>E</sup>[*Hn* <sup>−</sup> *<sup>H</sup>* <sup>|</sup> *<sup>T</sup>*] = 0 and, then,

$$\mathbb{E}[\langle \nabla \Psi(H), H\_{\hbar} - H \rangle \mid T] = \langle \nabla \Psi(H), \mathbb{E}[H\_{\hbar} - H \mid T] \rangle = 0$$

the expression <sup>E</sup>[*Hn* <sup>−</sup> *<sup>H</sup>* <sup>|</sup> *<sup>T</sup>*] being intended as a Bochner integral. Thus, the main identity (26) follows immediately from (25), for any <sup>Ψ</sup> <sup>∈</sup> C2 *<sup>b</sup>*(H). Once (26) is established for regular Ψ's, one can extend its validity to more general continuous Ψ's by standard approximation arguments.

## *5.3. Lemma 1*

First, observe that:

$$-\mathsf{K}(T \parallel \theta) + \int\_{\mathcal{X}} \log f(\boldsymbol{z} \mid \boldsymbol{T}) \mu(\mathrm{d}\boldsymbol{z} \mid \boldsymbol{T}) = \theta \boldsymbol{M}'(\boldsymbol{T}) - \mathsf{M}(\boldsymbol{\theta}) \; .$$

Notice also that:

$$\int\_{\mathcal{H}} \Psi(h) \nu\_n(\mathbf{d}h) = \mathbb{E}\left[ \Psi \left( \theta \mapsto \frac{\theta}{n} \sum\_{i=1}^n S(X\_i) - M(\theta) \right) \right].$$

Then, repeat the same arguments as in the previous proof, getting

$$\begin{split} \int\_{\mathcal{H}} \Psi(h) \boldsymbol{\nu}\_{n}(\mathbf{d}h) &= \int\_{\mathcal{H}} \Psi(h) \boldsymbol{\nu}(\mathbf{d}h) \\ &+ \frac{1}{2n} \int\_{\Theta} \left[ \frac{\mathbf{d}^{2}}{\mathbf{d}x^{2}} \Psi(\theta \mapsto \mathbf{x}\theta - M(\theta)) \right]\_{\Big|\mathbf{x} = M'(t)} \boldsymbol{M}''(t) p(t) \mathbf{d}t + o\left(\frac{1}{n}\right) . \end{split}$$

For standard exponential families, the function *M* is ono-to-one, with inverse function *V*. Whence, by indicating the range of *M* as *Cod*(*M* ),

$$\begin{split} & \int\_{\Theta} \left[ \frac{\mathbf{d}^{2}}{\mathbf{d} \mathbf{x}^{2}} \Psi(\theta \mapsto \mathbf{x}\theta - M(\theta)) \right]\_{\Big| \mathbf{x} = M'(t)} \mathbf{M}''(t) p(t) \mathbf{d}t \\ & \qquad = \int\_{\text{Coul}(M')} \left[ \frac{\mathbf{d}^{2}}{\mathbf{d} \mathbf{x}^{2}} \Psi(\theta \mapsto \mathbf{x}\theta - M(\theta)) \right] \mathbf{M}''(V(\mathbf{x})) p(V(\mathbf{x})) V'(\mathbf{x}) \mathbf{d}\mathbf{x} \\ & \qquad = \int\_{\text{Coul}(M')} \Psi(\theta \mapsto \mathbf{x}\theta - M(\theta)) \left[ \frac{\mathbf{d}^{2}}{\mathbf{d} \mathbf{x}^{2}} [M''(V(\mathbf{x})) p(V(\mathbf{x})) V'(\mathbf{x})] \right] \mathbf{d}\mathbf{x} . \end{split}$$

where, for the last identity, a double integration-by-parts has been used. Finally, changing the variable according to *x* = *M* (*t*) leads to the desired result.

## *5.4. Proposition 2*

A disintegration argument shows that

$$\begin{split} \mathfrak{R}\_{\mathbb{U}}[\mathcal{U}\_{n,m}^{\dagger}] &= \int\_{\mathcal{X}^{n}} \int\_{\Theta \times \mathcal{X}^{m}} \mathcal{L}\_{\mathbb{U}} \left( u\_{n,m}(\mathbf{y};\mathbf{x};\theta), u\_{n,m}^{\dagger}(\mathbf{x}) \right) \times \\ & \quad \times \mathbb{P}[(\mathcal{X}\_{n+1}, \dots, \mathcal{X}\_{n+m}) \in \operatorname{d\!}\mathsf{y}, \mathsf{T} \in \operatorname{d\!}\theta \mid (\mathcal{X}\_{1}, \dots, \mathcal{X}\_{n}) = \mathsf{x}] \mathsf{a}\_{n}(\operatorname{d\!}\mathsf{x}) \\ &= \int\_{\mathcal{X}^{n}} \int\_{\Theta \times \mathcal{X}^{m}} \mathcal{L}\_{\mathbb{U}} (u\_{n,m}(\mathbf{y};\mathbf{x};\theta), u\_{n,m}^{\dagger}(\mathbf{x})) \mu^{\otimes n}(\mathbf{d}\mathbf{y} \mid \theta) \pi\_{n}(\mathbf{d}\theta \mid \mathbf{x}) \boldsymbol{a}\_{n}(\mathbf{d}\mathbf{x}) \\ &= \int\_{\mathcal{X}^{n}} \int\_{\Theta} \mathcal{W}\_{\mathbb{U}}^{2} (\mathcal{I}\_{\theta,\mathbf{x}}; \delta\_{u\_{n,m}^{\dagger}(\mathbf{x})}) \pi\_{n}(\mathbf{d}\theta \mid \mathbf{x}) \boldsymbol{a}\_{n}(\mathbf{d}\mathbf{x}) \ . \end{split}$$

Then, use the triangular inequality for the Wasserstein distance to obtain:

$$\mathcal{W}\_{\mathrm{U}}\Big(\gamma\_{\theta,\mathbf{x}};\delta\_{u^{\mathrm{t}}\_{n,m}(\mathbf{x})}\Big) \leq \mathcal{W}\_{\mathrm{U}}\Big(\gamma\_{\theta,\mathbf{x}};\gamma\_{\tau,\mathbf{x}}\Big) + \mathcal{W}\_{\mathrm{U}}\Big(\gamma\_{\tau,\mathbf{x}};\delta\_{u^{\mathrm{t}}\_{n,m}(\mathbf{x})}\Big).$$

for any *τ* ∈ Θ. Take the square of both side and observe that:

$$\mathcal{W}\!\_{\mathrm{U}}\Big(\gamma\_{\tau,\mathbf{x}};\delta\_{\mathfrak{u}^{\dagger}\_{\mathfrak{u}^{\star}\_{\mathfrak{u}^{\star}}}(\mathbf{x})\Big)}=\int\_{\mathbb{U}}\mathrm{d}^{2}\_{\mathbb{U}}(\boldsymbol{\mu},\boldsymbol{\mu}^{\dagger}\_{\mathfrak{u},\mathfrak{u}}(\mathbf{x}))\gamma\_{\tau,\mathbf{x}}(\mathrm{d}\boldsymbol{\mu})\,\,\,.$$

Now, (31) is proved by letting *τ* = *T*† *<sup>n</sup>*,*<sup>m</sup>* after noticing that the latter summand in the above right-hand side is independent of *θ*.

Finally, (32) is obtained by first optimizing the risk R<sup>Θ</sup> with respect to the choice of *T*† *n*,*m* and then, after getting *T*ˆ *<sup>n</sup>*,*m*, the term E <sup>U</sup> d2 U(*U*† *<sup>n</sup>*,*m*, *u*)*γT*<sup>ˆ</sup> *n*,*m*,(*X*1,...,*Xn*)(d*u*) is optimized with respect to the choice of *U*† *<sup>n</sup>*,*m*.

#### *5.5. Proposition 3*

Preliminarily, use Theorem 1 in [52] [Section II.1] to prove that:

$$\int\_{\Theta} \varphi(\theta) \varepsilon^{-\kappa\_l(\theta)} \mathbf{d}\theta = \frac{2\sqrt{\pi}}{\sqrt{n}} \varepsilon^{-\kappa\_l(t)} \left[ \mathfrak{c}\_0 + \frac{\mathfrak{c}\_2}{2n} + o\left(\frac{1}{n}\right) \right],$$

holds for any *<sup>ϕ</sup>* <sup>∈</sup> C2(Θ) such that *<sup>ϕ</sup>*(*t*) <sup>=</sup> 0, where

$$\begin{aligned} c\_0 &:= \frac{b\_0}{2a\_0^{1/2}}\\ c\_2 &:= \left\{ \frac{b\_2}{2} - \frac{3a\_1b\_1}{a\_0} + [5a\_1^2 - 4a\_0a\_2] \frac{3b\_0}{16a\_0^2} \right\} \times \frac{1}{a\_0^{3/2}}, \end{aligned}$$

with *a*<sup>0</sup> := <sup>1</sup> 2 *κ <sup>t</sup>* (*t*), *<sup>a</sup>*<sup>1</sup> := <sup>1</sup> 3! *κ <sup>t</sup>* (*t*), *<sup>a</sup>*<sup>2</sup> := <sup>1</sup> 4! *κ <sup>t</sup>* (*t*), *b*<sup>0</sup> = *ϕ*(*t*), *b*<sup>1</sup> = *ϕ* (*t*) and *b*<sup>2</sup> = <sup>1</sup> 2 *ϕ* (*t*). Moreover, from that very same theorem, it holds that:

$$\int\_{\Theta} \varrho(\theta) e^{-\kappa\_l(\theta)} \mathbf{d}\theta = \sqrt{\pi} e^{-\kappa\_l(t)} \left[ \frac{c\_1}{n^{3/2}} + o\left(\frac{1}{n^{3/2}}\right) \right].$$

for any *<sup>ϕ</sup>* <sup>∈</sup> C2(Θ) with a zero of order 1 at *<sup>t</sup>*, where

$$c\_1 := \left[\frac{b\_1^\*}{2} - \frac{a\_1 b\_0^\*}{2a\_0}\right] \frac{1}{a\_0}.$$

with *b*∗ <sup>0</sup> := *ϕ* (*t*) and *b*∗ <sup>1</sup> := <sup>1</sup> 2 *ϕ* (*t*). At this stage, application of this formulas gives:

$$\int\_{\Theta} \mathfrak{m}(\theta) \pi\_n^\*(\mathsf{d}\theta \mid -\kappa\_t) = \mathfrak{m}(t) + \frac{1}{n a\_0} \left[ \frac{1}{4} \mathfrak{m}''(t) + \frac{1}{2} \mathfrak{m}'(t) \frac{p'(t)}{p(t)} - \frac{3}{4} \frac{a\_1}{a\_0} \mathfrak{m}'(t) \right] + o\left(\frac{1}{n}\right)$$

and

$$\begin{split} \int\_{\Theta} \mathfrak{m}^2(\theta) \pi\_n^\*(\mathrm{d}\theta \mid -\kappa\_l) &= \mathfrak{m}^2(t) + \frac{1}{na\_0} \Big[ \frac{1}{2} (\mathfrak{m}^{'}(t))^2 + \frac{1}{2} \mathfrak{m}^{''}(t) \mathfrak{m}(t) \\ &\quad + \mathfrak{m}^{'}(t) \mathfrak{m}(t) \frac{p'(t)}{p(t)} - \frac{3}{2} \frac{a\_1}{a\_0} \mathfrak{m}^{'}(t) \mathfrak{m}(t) \Big] + o\left(\frac{1}{n}\right) \end{split}$$

.

Then, in addition,

$$\int\_{\Theta} [\mathbf{v}(\theta) - \mathbf{v}(t)] \pi\_n^\*(\mathbf{d}\theta \mid -\kappa\_l) = \frac{1}{na\_0} \left[ \frac{1}{4} \mathbf{v}''(t) + \frac{1}{2} \mathbf{v}'(t) \frac{p'(t)}{p(t)} - \frac{3}{4} \frac{a\_1}{a\_0} \mathbf{v}'(t) \right] + o\left(\frac{1}{n}\right)$$

and

$$\begin{split} \int\_{\Theta} \Delta^{2}(\theta,\tau) \pi\_{n}^{\*} (\mathsf{d}\theta \mid -\kappa\_{l}) &= \Delta^{2}(t,\tau) \\ &+ \frac{1}{na\_{0}} \Big[ \frac{1}{2} \frac{\partial}{\partial \theta} \Delta^{2}(\theta,\tau)\_{\vert\_{\theta=t}} \frac{p'(t)}{p(t)} \\ &+ \frac{1}{2} \frac{\partial^{2}}{\partial \theta^{2}} \Delta^{2}(\theta,\tau)\_{\vert\_{\theta=t}} - \frac{3}{4} \frac{a\_{1}}{a\_{0}} \frac{\partial}{\partial \theta} \Delta^{2}(\theta,\tau)\_{\vert\_{\theta=t}} \Big] + o\left(\frac{1}{n}\right), \end{split}$$

completing the proof just by mere substitutions.

## *5.6. Theorem 2*

The core of the proof hinges on the identity (45). Now, the asymptotic expansion of its left-hand side is provided by (34), while the analogous expansion for right-hand side follows from a combination of (40) with (41). It is now straightforward to notice that the validity of (43) and (44) entails (45). At this stage, the validity of (46) and (47) for *R*ˆ 0,*<sup>m</sup>* and *R*ˆ 1,*<sup>m</sup>* follows directly by substitution. As to the same identities for *R*<sup>∗</sup> 0,*<sup>m</sup>* and *R*<sup>∗</sup> 1,*m*, the argument rests on the combination of (3) with (32), exploiting the fact that the additional term (33) is of *o*(1/*n*)-type. Thus, the asymptotic expansion of the left-hand side of (3) is given in terms of integrals with respect to *ν* of the sum of the two left-hand sides of (40) and (41), respectively. Resorting once again to (45), one gets the desired identities for *R*∗ 0,*m* and *R*∗ 1,*<sup>m</sup>* by substitution.

## *5.7. Theorem 3*

The core is the proof of (10), with the same expressions (46) and (47) also for *R*˜ 0,*<sup>m</sup>* and *R*˜ 1,*m*, respectively. As in the proof of Theorem 2, the left-hand side of (10) is analyzed by resorting to inequality (32), exploiting the fact that the ensuing additional term, similar to that in (33), is of *o*(1/*n*)-type. Now, the argument is very similar to that of the preceding proof, with the variant that now the expansion (35) is not optimized in *τ*, but it is just evaluated at *τ* = *τ*ˆ*n*. The conclusion reduces once again to a matter of substituting the expressions (48)–(50) into the two expansions (35) and (41).

#### **6. Conclusions and Future Developments**

This paper should be seen as a pioneering work in the field of predictive problems, whose main aim is to show how the practical construction of efficient estimators of random quantities (that depend on future and/or past observations) entails nonstandard metrizations of the parameter space Θ. This is the essence of the compatibility Equations (43) and (44). Of course, all the lines of research proposed in this paper deserve much more attention, in order to produce new results of wider validity.

The first issue deals with the extension of the compatibility equations to higher dimensions, including the infinite dimension. For finite dimensions, this is only a technical fact. Indeed, the question relies on extending the asymptotic expansion given in Proposition 3 from dimension 1 to dimension *d* > 1. This is done in [39] as far as the Bayesian setting, and in [53,54] for a general mathematical setting. See also [55] [Section 2.2]. For the infinite dimension, the mathematical literature is rather scant. Some interesting results on asymptotic expansions of Laplace type for separable Hilbert spaces with Gaussian measure are contained in [56]. Finally, the topic is still in its early stage as far as metric measure spaces (i.e., the full nonparametric setting) are concerned. See [57,58].

Another mathematical tool that proves to be critical to our study is the Wasserstein distance. As explained in specific monographs like [27,59], the Wasserstein distance has several connections with other fields of mathematical analysis, such as optimal transportation and the theory of PDEs. Actually, the achievement of some estimators within our theory (like the one in (55)) is tightly connected with some optimization issues in transport theory. In this respect, an interesting mathematical area to explore is represented by the theory of Wasserstein barycenters and the ensuing numerical algorithms. See [60]. Research on this is ongoing.

Then, all the extensions of de Finetti's Law of Large Numbers for the log-likelihood process, stated in Theorem 1, Proposition 1 and Lemma 1 in Section 2.2, are worth being reconsidered, independently of their use for the purposes of this paper. As to possible extensions, the first hint is concerned with the analysis of dominated, parametric nonregular models, as those considered in [61–63]. Here, in fact, we never used the properties of the MLE as the root of the gradient of the log-likelihood, so that the asymptotic results contained in the quoted works should be enough to extend our statements. Subsequently, it would be also very interesting to consider dominated models which are parametrized by infinite-dimensional objects, where typically the MLE does not exist. See, e.g., the recent book [64] for plenty of examples.

As to more statistical objectives, it would be interesting to further deepen the connection between our approach and some relevant achievements obtained within the empirical Bayes theory, such as those contained in [22,23,65–68]. See also the book [69] for plenty of applications. In particular, the discussion contained in Section 4.4 about the original Poisson-mixture setting considered by Herbert Robbins deserves more attention.

A very fertile area of the application of predictive inference is that of *species sampling problems*. The pioneering works on this topic can be identified with the works [66,67,70]. Nowadays, the Bayesian approach (especially of nonparametric type) has received much attention, and has produced noteworthy new results in this field. See [17,71–73] and also [55,74,75] for novel asymptotic results. Indeed, it would be interesting to investigate whether it is possible to derive, within the approach of this paper, both asymptotic results and new estimators, hopefully more competitive than the existing ones.

Another prolific field of application is that of *density estimation*, aimed at solving clustering and/or classification problems. See [76] for a Bayesian perspective. Here, there is an additional technical difficulty due to the fact that the parameter is an element of some infinite-dimensional manifold, so that the characterization of any metric on Θ will prove mathematically more complex.

A last mention is devoted to predictive problems with "compressed data". This kind of research comes directly from computer science, where the complexity of the observed data make the available sample essentially useless for statistical inference purposes. For this reason, many algorithms have been conceived to compress the information in order to make it useful in some sense. See, e.g., [77]. Here, the Bayesian approach is in its early stage (see [78]), and the results of this paper can provide a valuable contribution.

**Funding:** This research received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme under grant agreement No 817257.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** I wish to express my enormous gratitude and admiration to Eugenio Regazzini. He has represented for me a constant source of inspiration, transmitting enthusiasm and method for the development of my own research. This paper represents a small present for his 75-th birthday.

**Conflicts of Interest:** The author declares no conflict of interest.

## **References**

