*Article* **On Johnson's "Sufficientness" Postulates for Feature-Sampling Models**

**Federico Camerlenghi 1,2,3,***<sup>∗</sup>* **and Stefano Favaro 2,4,5**


**Abstract:** In the 1920s, the English philosopher W.E. Johnson introduced a characterization of the symmetric Dirichlet prior distribution in terms of its predictive distribution. This is typically referred to as Johnson's "sufficientness" postulate, and it has been the subject of many contributions in Bayesian statistics, leading to predictive characterization for infinite-dimensional generalizations of the Dirichlet distribution, i.e., species-sampling models. In this paper, we review "sufficientness" postulates for species-sampling models, and then investigate analogous predictive characterizations for the more general feature-sampling models. In particular, we present a "sufficientness" postulate for a class of feature-sampling models referred to as Scaled Processes (SPs), and then discuss analogous characterizations in the general setup of feature-sampling models.

**Keywords:** Bayesian nonparametrics; exchangeability; feature-sampling model; de Finetti theorem; Johnson's "sufficientness" postulate; predictive distribution; scaled process prior; speciessampling model

## **1. Introduction**

Exchangeability (de Finetti [1]) provides a natural modeling assumption in a large variety of statistical problems, and it amounts to the assumption that the order in which observations are recorded is not relevant. Consider a sequence of random variables (*Zj*)*j*≥<sup>1</sup> defined on a common probability space (Ω, A , P) and taking values in an arbitrary space, which is assumed to be Polish. The sequence (*Zj*)*j*≥<sup>1</sup> is exchangeable if and only if

$$(Z\_1, \ldots, Z\_n) \stackrel{\text{d}}{=} (Z\_{\sigma(1)}, \ldots, Z\_{\sigma(n)})$$

for any permutation *σ* of the set {1, ... , *n*} and any *n* ≥ 1. By virtue of the celebrated de Finetti representation theorem, exchangeability of (*Zj*)*j*≥<sup>1</sup> is tantamount to asserting the existence of a random element *μ*˜, defined on a (parameter) space Θ, such that, conditionally on *μ*˜, the *Zj*s are independent and identically distributed with common distribution *pμ*˜, i.e.,

$$\begin{aligned} \mathbf{Z}\_{j} \mid \tilde{\mu} &\stackrel{\text{id}}{\sim} p\_{\mu} \quad j \ge 1 \\ \tilde{\mu} &\sim \mathcal{A}', \end{aligned} \tag{1}$$

where M is the distribution of *μ*˜. In a Bayesian setting, M takes on the interpretation of a prior distribution for the parameter object of interest. In this sense, the de Finetti representation theorem is a natural framework for Bayesian statistics. For mathematical convenience,

**Citation:** Camerlenghi, F.; Favaro, S. On Johnson's "Sufficientness" Postulates for Feature-Sampling Models. *Mathematics* **2021**, *9*, 2891. https://doi.org/10.3390/math9222891

Academic Editors: Emanuele Dolera and Federico Bassetti

Received: 10 October 2021 Accepted: 10 November 2021 Published: 13 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Θ is assumed to be a Polish space, equipped with the Borel *σ*-algebra B(Θ). Hereafter, with the term *parameter*, we refer to both a finite- and an infinite-dimensional object.

Within the framework of exchangeability (1), a critical role is played by the predictive distributions, namely, the conditional distributions of the (*n* + 1)th observation *Zn*+<sup>1</sup> given *Z<sup>n</sup>* := (*Z*1, ... , *Zn*). The problem of characterizing prior distributions M in terms of their predictive distributions has a long history in Bayesian statistics, starting from the seminal work of the English philosopher Johnson [2] who provided a predictive characterization of the symmetric Dirichlet prior distribution. Such a characterization is typically referred to as Johnson's "sufficientness" postulate. Species-sampling models (Pitman [3]) provide arguably the most popular infinite-dimensional generalization of the Dirichlet distribution. They form a broad class of nonparametric prior models that correspond to the assumption that *pμ*˜ in (1) is an almost surely discrete random probability measure

$$\vec{p} = \sum\_{i \ge 1} \vec{p}\_i \delta\_{\mathcal{Z}\_i \prime} \tag{2}$$

where: (i) (*p*˜*i*)*i*≥<sup>1</sup> are non-negative random weights almost surely summing up to 1; (ii) (*z*˜*i*)*i*≥<sup>1</sup> are random species' labels, independent of (*p*˜*i*)*i*≥1, and i.i.d. with common (non-atomic) distribution *P*. The term *species* refers to the fact that the law of *p*˜ is a prior distribution for the unknown species composition (*p*˜*i*)*i*≥<sup>1</sup> of a population of individuals *Zj*s, with *Zj* belonging to a species *z*˜*<sup>i</sup>* with probability *p*˜*<sup>i</sup>* for *j*, *i* ≥ 1. In the context of speciessampling models, Regazzini [4] and Lo [5] provided a "sufficientness" postulate for the Dirichlet process (Ferguson [6]). Such a characterization was then extended by Zabell [7] to the Pitman–Yor process (Perman et al. [8], Pitman and Yor [9]) and by Bacallado et al. [10] to the more general Gibbs-type prior models (Gnedin and Pitman [11]).

In this paper, we introduce and discuss Johnson's "sufficientness" postulates in the feature-sampling setting, which generalizes the species-sampling setting by allowing each individual of the population to belong to multiple species, now called features. We point out that feature-sampling models are extremely important in different areas of application; see, e.g., Griffiths and Ghahramani [12], Ayed et al. [13] and the references therein. Under the framework of exchangeability (1), the feature-sampling setting assumes that

$$Z\_{\bar{\jmath}}|\bar{\mu} = \sum\_{i \ge 1} A\_{\bar{\jmath},i} \delta\_{\bar{\wp}\_i} \sim p\_{\bar{\mu}\_{\bar{\jmath}}} \tag{3}$$

and

$$\vec{\mu} = \sum\_{i \ge 1} \vec{p}\_i \delta\_{\vec{w}\_i}$$

where: (i) conditionally on *<sup>μ</sup>*˜, (*Aj*,*i*)*i*≥<sup>1</sup> are independent Bernoulli random variables with parameters (*p*˜*i*)*i*≥1; (ii) (*p*˜*i*)*i*≥<sup>1</sup> are (0, 1)-valued random weights; (iii) (*w*˜*i*)*i*≥<sup>1</sup> are random features' labels, independent of (*p*˜*i*)*i*≥1, and i.i.d. with common (non-atomic) distribution *P*. That is, individual *Zj* displays feature *w*˜*<sup>i</sup>* if and only if *Aj*,*<sup>i</sup>* = 1, which happens with probability *p*˜*i*. For example, if, conditionally on *μ*˜, *Zj* displays only two features, say *w*˜ <sup>1</sup> and *w*˜ 5, it equals the random measure *δw*˜ <sup>1</sup> + *δw*˜ <sup>5</sup> . The distribution *pμ*˜ is the law of a Bernoulli process with parameter *μ*˜, which is denoted by BeP(*μ*˜), whereas the law of *<sup>μ</sup>*˜ is a nonparametric prior distribution for the unknown feature probabilities (*p*˜*i*)*i*≥1, i.e., a feature-sampling model. Here, we investigate the problem of characterizing prior distributions for *μ*˜ in terms of their predictive distributions, with the goal of providing "sufficientness" postulates for feature-sampling models. We discuss such a problem and present partial results for a class of feature-sampling models referred to as Scaled Process (SP) priors for *μ*˜ (James et al. [14], Camerlenghi et al. [15]). With these results, we aim at stimulating future research in this field to obtain "sufficientness" postulates for general feature-sampling models.

The paper is structured as follows. In Section 2, we present a brief review on Johnson's "sufficientness" postulates for species-sampling models. Section 3 focuses on nonparametric prior models for the Bernoulli process, i.e., feature-sampling models; we review their definitions, properties, and sampling structures. In Section 4, we present a "sufficientness" postulate for SPs. Section 5 concludes the paper by discussing our results and conjecturing analogous results for more general classes of feature-sampling models.

## **2. Species-Sampling Models**

To introduce species-sampling models, we assume that the observations are Z-valued random elements, and Z is supposed to be a Polish space whose Borel *σ*-algebra is denoted by Z . Thus, Z contains all the possible species' labels of the populations. When we deal with species-sampling models, the hierarchical formulation (1) specializes as

$$\begin{aligned} Z\_j|\mathfrak{p} &\stackrel{\text{id}}{\sim} \mathfrak{p} & j \ge 1\\ \mathfrak{p} &\sim \mathcal{A} \end{aligned} \tag{4}$$

where *<sup>p</sup>*˜ <sup>=</sup> <sup>∑</sup>*i*≥<sup>1</sup> *<sup>p</sup>*˜*iδz*˜*<sup>i</sup>* is an almost surely discrete random probability measure on <sup>Z</sup>, and <sup>M</sup> denotes its law. We also remind the reader that: (i) (*p*˜*i*)*i*≥<sup>1</sup> are non-negative random weights almost surely summing up to 1; (ii) (*z*˜*i*)*i*≥<sup>1</sup> are random species' labels, independent of (*p*˜*i*)*i*≥1, and i.i.d. as a common (non-atomic) distribution *<sup>P</sup>*. Using the terminology of Pitman [3], the discrete random probability measure *p*˜ is a *species-sampling model*. In Bayesian nonparametrics, popular examples of species-sampling models are: the Dirichlet process (Ferguson [6]), the Pitman–Yor process (Perman et al. [8], Pitman and Yor [9]), and the normalized generalized Gamma process (Brix [16], Lijoi et al. [17]). These are examples belonging to a peculiar subclass of species-sampling models, which are referred to as Gibbstype prior models (Gnedin and Pitman [11], De Blasi et al. [18]). More general subclasses of species-sampling models are, e.g., the homogeneous normalized random measures (Regazzini et al. [19]) and the Poisson–Kingman models (Pitman [20,21]). We refer to Lijoi and Prünster [22] and Ghosal and van der Vaart [23] for a detailed and stimulating account on species-sampling models and their use in Bayesian nonparametrics.

Because of the almost sure discreteness of *p*˜ in (4), a random sample *Z<sup>n</sup>* := (*Z*1, ... , *Zn*) from *<sup>p</sup>*˜ features ties, that is, <sup>P</sup>(*Zj*<sup>1</sup> <sup>=</sup> *Zj*<sup>2</sup> ) <sup>&</sup>gt; 0 if *<sup>j</sup>*<sup>1</sup> <sup>=</sup> *<sup>j</sup>*2. Thus, *<sup>Z</sup><sup>n</sup>* induces a random partition of the set {1, ... , *n*} into *Kn* = *k* ≤ *n* blocks, labeled by *Z*<sup>∗</sup> <sup>1</sup> , ... , *Z*<sup>∗</sup> *Kn* , with corresponding frequencies (*Nn*,1, ... , *Nn*,*Kn* )=(*n*1, ... , *nk*), such that *Ni*,*<sup>n</sup>* ≥ 1 and <sup>∑</sup>1≤*i*≤*Kn Ni*,*<sup>n</sup>* = *<sup>n</sup>*. From Pitman [3], the predictive distribution of *p*˜ is of the form

$$\mathbb{P}(Z\_{n+1}\in A|\mathbf{Z}\_n) = \operatorname{g}(n,k,\mathfrak{n})\mathbb{P}(A) + \sum\_{i=1}^k f\_i(n,k,\mathfrak{n})\delta\_{Z\_i^\*}(A), \quad A \in \mathcal{X}',\tag{5}$$

for any *n* ≥ 1, having set *n* = (*n*1, ... , *nk*), with *g* and *fi* being arbitrary non-negative functions that satisfy the constraint *g*(*n*, *k*, *n*) + ∑*<sup>k</sup> <sup>i</sup>*=<sup>1</sup> *fi*(*n*, *k*, *n*) = 1. The predictive distribution (5) admits the following interpretation: (i) *g*(*n*, *k*, *n*) corresponds to the probability that *Zn*+<sup>1</sup> is a new species, that is, a species not observed in *Zn*; (ii) *fi*(*n*, *k*, *n*) corresponds to the probability that *Zn*+<sup>1</sup> is a species *Z*<sup>∗</sup> *<sup>i</sup>* in *Zn*. The functions *g* and *fi* completely determine the distribution of the exchangeable sequence (*Zj*)*j*≥<sup>1</sup> and, in turn, the distribution of the random partition of N induced by (*Zj*)*j*≥1. Predictive distributions of popular speciessampling models, e.g., the Dirichlet process, the Pitman–Yor process, and the normalized generalized Gamma process, are of the form (5) for suitable specification of the functions *g* and *fi*. We refer to Pitman [21] for a detailed account of random partitions induced by species-sampling models and generalizations thereof.

Here, we recall the predictive distribution of Gibbs-type prior models (Gnedin and Pitman [11], De Blasi et al. [18]). Let us first introduce the definition of these processes.

**Definition 1.** *Let <sup>σ</sup>* <sup>∈</sup> (−∞, 1) *and let <sup>P</sup> be a (non-atomic) distribution on* (Z, <sup>Z</sup> )*. A Gibbs-type prior model is a species-sampling model with a predictive distribution of the form*

$$\mathbb{P}(Z\_{n+1}\in A|\mathbf{Z}\_{\mathfrak{n}}) = \frac{V\_{n+1,k+1}}{V\_{n,k}}P(A) + \frac{V\_{n+1,k}}{V\_{n,k}}\sum\_{i=1}^{k}(n\_i - \sigma)\delta\_{\mathbb{Z}\_i^\*}(A), \quad A \in \mathcal{X}',\tag{6}$$

*for any n* ≥ 1*, where* {*Vn*,*<sup>k</sup>* : *n* ≥ 1, 1 ≤ *k* ≤ *n*} *is a collection of non-negative weights that satisfy the recurrence relation Vn*,*<sup>k</sup>* = (*n* − *σk*)*Vn*<sup>+</sup>1,*<sup>k</sup>* + *Vn*<sup>+</sup>1,*k*+<sup>1</sup> *for all k* = 1, ... , *n, n* ≥ 1*, with the proviso V*1,1 = 1*.*

Note that the Dirichlet process is a Gibbs-type prior model that corresponds to

$$V\_{n,k} = \frac{\theta^k}{(\theta)\_n}.$$

for *θ* > 0, where we have denoted by (*a*)*<sup>b</sup>* = Γ(*a* + *b*)/Γ(*a*) the Pochhammer symbol for the rising factorials. Moreover, the Pitman–Yor process is a Gibbs-type prior model corresponding to

$$V\_{n,k} = \frac{\prod\_{i=0}^{k-1} (\theta + i\sigma)}{(\theta)\_n}$$

for *σ* ∈ (0, 1) and *θ* > −*α*. We refer to Pitman [20] for other examples of Gibbs-type prior models and for a detailed account of the *Vn*,*k*s; see also Pitman [21] and the references therein.

Because of de Finetti's representation theorem, there exists a one-to-one correspondence between the functions *g* and *fi* in the predictive distribution (5) and the law M of *p*˜, i.e., the de Finetti measure. This is at the basis of Johnson's "sufficientness" postulates, characterizing species-sampling models through their predictive distributions. Regazzini [4] and, later, Lo [5] provided the first "sufficientness" postulate for species-sampling models, showing that the Dirichlet process is the unique species-sampling model for which the function *g* depends on *Z<sup>n</sup>* only through *n*, and the function *fi* depends on *Z<sup>n</sup>* only through *n* and *ni* for *i* ≥ 1. Such a result was extended in Zabell [24], providing the following "sufficientness" postulate for the Pitman–Yor process: The Pitman–Yor process is the unique species-sampling model for which the function *g* depends on *Z<sup>n</sup>* only through *n* and *k*, and the function *fi* depends on *Z<sup>n</sup>* only through *n* and *ni* for *i* ≥ 1. Bacallado et al. [10] discussed the "sufficientness" postulate in the more general setting of Gibbs-type prior models, showing that Gibbs-type prior models are the sole species-sampling models for which the function *g* depends on *Z<sup>n</sup>* only through *n* and *k*, and the function *fi* depends on *Z<sup>n</sup>* only through *n*, *k*, and *ni*. This result shows a critical difference—at the sampling level—between the Pitman–Yor process and Gibbs-type prior models, which lies in the inclusion of the sampling information on the observed number of distinct species in the probability of observing, at the (*n* + 1)-th draw, a species already observed in the sample.

## **3. Feature-Sampling Models**

Feature-sampling models generalize species-sampling models by allowing each individual to belong to more than one species, which are now called features. To introduce feature-sampling models, we consider a space of features W, which is assumed to be a Polish space, and we denote by W its Borel *σ*-field. Thus, W contains all the possible features' labels of the population. Observations are represented through the counting measure (3), whose parameter *μ*˜ is an almost surely discrete measure with masses in (0, 1). When we deal with feature-sampling models, the hierarchical formulation (1) specializes as

$$\begin{aligned} \mathbf{Z}\_{\overline{\mathbb{M}}} | \overline{\mu} & \stackrel{\text{iid}}{\sim} \mathbf{BeP}(\overline{\mu})\\ \overline{\mu} & \sim \mathcal{A} \end{aligned} \tag{7}$$

where *<sup>μ</sup>*˜ <sup>=</sup> <sup>∑</sup>*i*≥<sup>1</sup> *<sup>p</sup>*˜*iδw*˜*<sup>i</sup>* is an almost surely discrete random measure on <sup>W</sup>, and <sup>M</sup> denotes its law. We also remind the reader that: (i) conditionally on *<sup>μ</sup>*˜, (*Aj*,*i*)*i*≥<sup>1</sup> are independent Bernoulli random variables with parameters (*p*˜*i*)*i*≥1; (ii) (*p*˜*i*)*i*≥<sup>1</sup> are (0, 1)-valued random weights; (iii) (*w*˜*i*)*i*≥<sup>1</sup> are random features' labels, independent of (*p*˜*i*)*i*≥1, and i.i.d. with common (non-atomic) distribution *P*. Completely random measures (CRMs) (Daley and Vere-Jones [25], Kingman [26]) provide a popular class of nonparametric priors M, the most common examples of which are the Beta process prior and the stable Beta process prior (Teh and Gorur [27], James [28]); see also Broderick et al. [29] and the references therein for other examples of CRM priors and generalizations thereof. Recently, Camerlenghi et al. [15] investigated an alternative class of nonparametric priors M, generalizing CRM priors and referring to these as Scaled Processes (SPs). SP priors first appeared in the work of James [28].

We assume a random sample *Z<sup>n</sup>* := (*Z*1, ... , *Zn*) to be modeled as in (7), and we introduce the predictive distribution of *μ*˜, that is, the conditional probability of *Zn*+<sup>1</sup> given *Zn*. Note that, because of the pure discreteness of *μ*˜, the observations *Z<sup>n</sup>* may share a random number of distinct features, say *Kn* = *k*, denoted here as *W*<sup>∗</sup> <sup>1</sup> , ... , *W*<sup>∗</sup> *Kn* , and each feature *W*∗ *<sup>i</sup>* is displayed exactly by *Mn*,*<sup>i</sup>* = *mi* of the *n* individuals as *i* = 1, ... , *k*. Since the features' labels are immaterial and i.i.d. form the base measure *P*, the conditional distribution of *Zn*+1, given *Zn*, may be equivalently characterized through the vector (*Yn*+1, *A*<sup>∗</sup> *<sup>n</sup>*+1,1, ... , *A*<sup>∗</sup> *<sup>n</sup>*+1,*Kn* ), where: (i) *Yn*+<sup>1</sup> is the number of new features displayed by the (*n* + 1)th individual, namely, hitherto unobserved out of the sample *Zn*; (ii) *A*<sup>∗</sup> *n*+1,*i* is a {0, 1}-valued random variable for any *i* = 1, ... , *Kn*, and *A*<sup>∗</sup> *<sup>n</sup>*+1,*<sup>i</sup>* = 1 if the (*n* + 1)th individual displays feature *W*∗ *<sup>i</sup>* ; it equals 0 otherwise. Hence, the predictive distribution of *μ*˜ is

$$\mathbb{P}((\mathbf{Y}\_{n+1}, \mathbf{A}\_{n+1, 1}^\*, \dots, \mathbf{A}\_{n+1, \mathbf{K}\_n}^\*) = (y, a\_1, \dots, a\_{\mathbf{K}\_n}) | \mathbf{Z}\_n) = f(y, a\_1, \dots, a\_{\mathbf{k}}; n, k, m) \tag{8}$$

where we denote by *f* a probability distribution evaluated at (*y*, *a*1, ... , *ak*), and where *n*, *k* and *m* := (*m*1, ... , *mk*) is the sampling information. In the rest of this section, we specify the function *f* under the assumption of a CRM prior and an SP prior, showing its dependence on *n*, *Kn*, and (*Mn*,1, ... , *Mn*,*Kn* ). In particular, we show how SP priors allow one to enrich the predictive distribution of CRM priors by including additional sampling information in terms of the number of distinct features and their corresponding frequencies.

## *3.1. Priors Based on CRMs*

Let M<sup>W</sup> denote the space of all bounded and finite measures on (W, W ), that is to say, *μ* ∈ M<sup>W</sup> iff *μ*(*A*) < +∞ for any bounded set *A* ∈ W . Here, we recall the definition of a Completely Random Measure (CRM) (see, e.g., Daley and Vere-Jones [25]).

**Definition 2.** *A Completely Random Measure (CRM) μ*˜ *on* (W, W ) *is a random element taking values in the space* M<sup>W</sup> *such that the random variables μ*˜(*A*1), ... , *μ*˜(*An*) *are independent for any choice of bounded and disjoint sets A*1,..., *An* ∈ W *and for any n* ≥ 1*.*

We remind the reader that Kingman [26] proved that a CRM may be decomposed as the sum of a deterministic drift and a purely atomic component. In Bayesian nonparametrics, it is common to consider purely atomic CRMs without fixed points of discontinuity, that is to say, *<sup>μ</sup>*˜ may be represented as *<sup>μ</sup>*˜ := <sup>∑</sup>*i*≥<sup>1</sup> *<sup>η</sup>*˜*iδw*˜*<sup>i</sup>* , where (*η*˜*i*)*i*≥<sup>1</sup> is a sequence of random atoms and (*w*˜*i*)*i*≥<sup>1</sup> are the random locations. An appealing property of purely atomic CRMs is the availability of their Laplace functional; indeed, for any measurable function *<sup>f</sup>* : <sup>W</sup> <sup>→</sup> <sup>R</sup>+, one has

$$\mathbb{E}\left[e^{-\int\_{\mathcal{W}} f(w)\mu(\mathrm{d}w)}\right] = \exp\left\{-\int\_{\mathcal{W}\times\mathbb{R}^+} (1-e^{-sf(w)})\nu(\mathrm{d}w,\mathrm{d}s)\right\}\tag{9}$$

where *<sup>ν</sup>* is a measure on <sup>W</sup> <sup>×</sup> <sup>R</sup><sup>+</sup> called the Lévy intensity of the CRM *<sup>μ</sup>*˜, and it is such that

$$\nu(\{w\} \times \mathbb{R}^+) = 0 \quad \forall w \in \mathbb{W}, \quad \text{and} \int\_{A \times \mathbb{R}^+} \min\{s, 1\} \nu(\text{dw}, \text{ds}) < \infty \tag{10}$$

for any bounded Borel set *A*. Here, we focus on homogeneous CRMs by assuming that the atoms *η*˜*i*s and the locations *w*˜*i*s are independent; in this case, the Lévy measure may be written as

$$\nu(\mathbf{d}w, \mathbf{ds}) = \lambda(\mathbf{s})\mathbf{ds}P(\mathbf{d}w)$$

for some measurable function *<sup>λ</sup>* : <sup>R</sup><sup>+</sup> <sup>→</sup> <sup>R</sup><sup>+</sup> and a probability measure *<sup>P</sup>* on (W, <sup>W</sup> ), called the *base measure*, which is assumed to be diffuse. In this case, the distribution of *μ*˜ will be denoted as CRM(*λ*; *P*), and the second integrability condition in (10) reduces to the following:

$$\int\_{\mathbb{R}^+} \min\{s, 1\} \lambda(s) \mathrm{d}s < +\infty. \tag{11}$$

In the feature-sampling framework, *μ*˜ may be used as a prior distribution if the sequence of atoms (*η*˜*i*)*i*≥<sup>1</sup> is in between [0, 1], which happens if the Lévy intensity has support on <sup>W</sup> <sup>×</sup> [0, 1]. A noteworthy example, widely used in this setting, is the stable Beta process prior (Teh and Gorur [27]). It is defined as a CRM with Lévy intensity

$$\lambda(s) = \mathfrak{a} \cdot \frac{\Gamma(1+\varepsilon)}{\Gamma(1-\sigma)\Gamma(\varepsilon+\sigma)} s^{-1-\sigma} (1-s)^{\varepsilon+\sigma-1} \mathbb{1}\_{(0,1)}(s) \tag{12}$$

where *c* > 0, *σ* ∈ (0, 1), and *α* > 0 (James [28], Masoero et al. [30]). Now, we describe the predictive distribution for an arbitrary CRM *μ*˜. For the sake of clarity, we fix the following notation:

$$\text{Pois}(y; \mathbb{C}) := \frac{\text{Cy} \cdot \text{ ${}^{\text{C}}$ }}{y \text{ ${}^{\text{}}$ }}, y \in \mathbb{N} \text{ and } \text{Bern}(a; p) := p^{a} (1 - p)^{1 - a}, a \in \{0, 1\}$$

to denote the probability mass functions of a Poisson with parameter *C* > 0 and a Bernoulli random variable with parameter *p* ∈ [0, 1], respectively. We refer to James [28] for a detailed posterior analysis of CRM priors; see also Broderick et al. [29] and the references therein.

**Theorem 1** (James [28])**.** *Let Z*1, *Z*2, ... *be exchangeable random variables modeled as in* (7)*, where* M *equals* CRM(*λ*; *P*)*. If Z<sup>n</sup> is a random sample that displays Kn* = *k distinct features* {*W*<sup>∗</sup> <sup>1</sup> , ... , *W*<sup>∗</sup> *Kn* }*, and feature W*<sup>∗</sup> *<sup>i</sup> appears exactly Mn*,*<sup>i</sup>* = *mi times in the samples, such as i* = 1, . . . , *Kn, then*

$$\begin{split} \mathbb{P}((\mathbf{y}\_{n+1}, \mathbf{A}\_{n+1,1}^{\*}, \dots, \mathbf{A}\_{n+1, \mathbb{K}\_{n}}^{\*}) &= (y, a\_{1}, \dots, a\_{\mathbb{K}\_{n}}) | \mathbf{Z}\_{n}) \\ &= \text{Poisson}\left( y; \int\_{0}^{1} s(1-s)^{n} \boldsymbol{\lambda}(s) \mathbf{ds} \right) \prod\_{i=1}^{k} \text{Bern}(a\_{i}; p\_{i}^{\*}) \end{split} \tag{13}$$

*being*

$$p\_i^\* := \frac{\int\_0^1 s^{m\_i+1} (1-s)^{n-m\_i} \lambda(s) \mathbf{ds}}{\int\_0^1 s^{m\_i} (1-s)^{n-m\_i} \lambda(s) \mathbf{ds}}.$$

**Proof.** We consider James [28] (Proposition 3.2) for Bernoulli product models (see also Camerlenghi et al. [15] (Proposition 1)); thus, the distribution of *Zn*+1, given *Zn*, equals the distribution of

$$Z\_{n+1}' + \sum\_{i=1}^{K\_n} A\_{n+1,i}^\* \delta\_{W\_i^\*} \tag{14}$$

where *Z <sup>n</sup>*+1|*μ*˜ = <sup>∑</sup>*i*≥<sup>1</sup> *<sup>A</sup> n*+1,*i δw*˜ *<sup>i</sup>* <sup>∼</sup> BeP(*μ*˜ ) such that *<sup>μ</sup>*˜ <sup>∼</sup> CRM((<sup>1</sup> <sup>−</sup> *<sup>s</sup>*)*nλ*; *<sup>P</sup>*), and *A*∗ *<sup>n</sup>*+1,1, ... , *A*<sup>∗</sup> *<sup>n</sup>*+1,*Kn* are Bernoulli random variables with parameters *J*1, ... , *JKn* , respectively, such that each *Ji* is a random variable whose distribution is with a density function of the form

$$f\_{\overline{l\_i}}(s) \propto (1-s)^{n-m\_i} s^{m\_i} \lambda(s).$$

By exploiting the previous predictive characterization, we can derive the posterior distribution of *Yn*+<sup>1</sup> given *Z<sup>n</sup>* by means of a direct application of the Laplace functional. Indeed, the distribution of *Yn*+1|*Z<sup>n</sup>* equals ∑*i*≥<sup>1</sup> *A n*+1,*i* . Thus, for any *<sup>t</sup>* <sup>∈</sup> <sup>R</sup>, we have the following:

$$\begin{aligned} \mathbb{E}[\boldsymbol{e}^{-tY\_{n+1}}|\mathcal{Z}\_n] &= \mathbb{E}[\boldsymbol{e}^{-t\sum\_{i\geq 1}A'\_{n+1,i}}] = \mathbb{E}[\prod\_{i\geq 1}\boldsymbol{e}^{-tA'\_{n+1,i}}] = \mathbb{E}\left[\mathbb{E}\left[\prod\_{i\geq 1}\boldsymbol{e}^{-tA'\_{n+1,i}} \mid \boldsymbol{\mu}^{\prime}\right]\right] \\ &= \mathbb{E}\left[\prod\_{i\geq 1}\left(\boldsymbol{e}^{-t}\boldsymbol{\eta}^{\prime}\_{i} + (1-\boldsymbol{\eta}^{\prime}\_{i})\right)\right], \end{aligned}$$

where we used the representation *<sup>μ</sup>*˜ = <sup>∑</sup>*i*≥<sup>1</sup> *<sup>η</sup>*˜ *i δw*˜ *i* and the fact that the *An*+1,*i*s are independent Bernoulli random variables conditionally on *μ*˜ . We now use the Laplace functional for *μ*˜ to get

$$\begin{aligned} \mathbb{E}[\boldsymbol{\varepsilon}^{-tY\_{n+1}}|\mathcal{Z}\_n] &= \mathbb{E}\left[\exp\left\{\sum\_{i\geq 1} \log(1 + \eta\_i^t(\boldsymbol{e}^{-t} - 1))\right\}\right] \\ &= \exp\left\{- (1 - \boldsymbol{e}^{-t}) \int\_0^1 (1 - \boldsymbol{s})^n \boldsymbol{s} \boldsymbol{\lambda}(\boldsymbol{s}) \mathbf{ds}\right\}. \end{aligned}$$

As a direct consequence, the posterior distribution of *Yn*+<sup>1</sup> given *Z<sup>n</sup>* is a Poisson distribution with mean 1 <sup>0</sup> (1<sup>−</sup> *<sup>s</sup>*)*nsλ*(*s*)d*s*. Again, by exploiting the predictive representation (14), the posterior distribution of *A*∗ *n*+1,*i* , as *i* = 1, ... , *Kn*, is a Bernoulli with the following mean:

$$\mathbb{E}[f\_i] = \int\_0^1 s f\_{\bar{l}\_i}(s) \mathrm{d}s = \frac{\int\_0^1 (1-s)^{n-m\_i} s^{m\_i+1} \lambda(s) \mathrm{d}s}{\int\_0^1 (1-s)^{n-m\_i} s^{m\_i} \lambda(s) \mathrm{d}s}.$$

**Corollary 1.** *Let Z*1, *Z*2, ... *be exchangeable random variables modeled as in* (7)*, where* M *is the law of the stable Beta process. If Z<sup>n</sup> is a random sample that displays Kn* = *k distinct features* {*W*<sup>∗</sup> <sup>1</sup> , ... , *W*<sup>∗</sup> *Kn* }*, and feature W*<sup>∗</sup> *<sup>i</sup> appears exactly Mn*,*<sup>i</sup>* = *mi times in the samples, such as i* = 1, . . . , *Kn, then*

$$\begin{split} \mathbb{P}((Y\_{n+1}, A\_{n+1,1}^\*, \dots, A\_{n+1,K\_n}^\*) &= (y\_\prime a\_1, \dots, a\_{\mathbb{K}\_n}) | \mathbf{Z}\_n) \\ &= \text{Poisson}\left( y\_\prime a \frac{(c+\sigma)\_n}{(c+1)\_n} \right) \prod\_{i=1}^k \text{Bern}\left( a\_i; \frac{m\_i - \sigma}{n+c} \right), \end{split} \tag{15}$$

*where* (*x*)*<sup>y</sup>* = Γ(*x* + *y*)/Γ(*x*) *denotes the Pochhammer symbol for x*, *y* > 0*.*

**Proof.** It is sufficient to specialize Theorem 1 for the stable Beta process. In particular, from Theorem 1, the posterior distribution of *Yn*+<sup>1</sup> given *Z<sup>n</sup>* is a Poisson distribution with mean

$$\int\_0^1 s(1-s)^n \lambda(s) \mathrm{d}s \stackrel{(12)}{=} \frac{a \Gamma(1+c)}{\Gamma(1-\sigma)\Gamma(c+\sigma)} \int\_0^1 s^{-\sigma}(1-s)^{n+c+\sigma} \mathrm{d}s = a \frac{(c+\sigma)\_n}{(c+1)\_n}.$$

Moreover, the parameters of the Bernoulli random variables *A*∗ *<sup>n</sup>*+1,1, ... , *A*<sup>∗</sup> *<sup>n</sup>*+1,*Kn* are equal to

$$p\_i^\* = \frac{\int\_0^1 s^{m\_i+1} (1-s)^{n-m\_i} \lambda(s) \mathrm{d}s}{\int\_0^1 s^{m\_i} (1-s)^{n-m\_i} \lambda(s) \mathrm{d}s} \stackrel{(2)}{=} \frac{B(m\_i+1-\sigma, \mathfrak{c}+\sigma+n-m\_i)}{B(m\_i-\sigma, \mathfrak{c}+\sigma+n-m\_i)} = \frac{m\_i-\sigma}{n+\sigma}$$

as *i* = 1, . . . , *Kn*.

## *3.2. SP Priors*

From Theorem 1, under CRM priors, the distribution of the number of new features *Yn*+<sup>1</sup> is a Poisson distribution that depends on the sampling information only through the sample size *n*. Moreover, the probability of observing a feature already observed in the sample, say *W*∗ *<sup>i</sup>* , depends only on the sample size *n* and the frequency *mi* of feature *W*∗ *<sup>i</sup>* out of the initial sample. Camerlenghi et al. [15] showed that SP priors allow one to enrich the predictive structure of CRM priors, including additional sampling information in the probability of discovering new features. To introduce SP priors, consider a CRM *<sup>μ</sup>*˜ <sup>=</sup> <sup>∑</sup>*i*≥<sup>1</sup> *<sup>τ</sup>*˜*iδw*˜*<sup>i</sup>* on <sup>W</sup>, where (*τ*˜*i*)*i*≥<sup>1</sup> are positive random atoms and (*w*˜*i*)*i*≥<sup>1</sup> are i.i.d. random atoms, with Lévy intensity *ν*(d*w*, d*s*) = *λ*(*s*)d*sP*(d*w*) satisfying

$$\int\_0^\infty \min\{s, 1\} \lambda(s) \mathrm{ds} < +\infty. \tag{16}$$

Consider the ordered jumps Δ<sup>1</sup> > Δ<sup>2</sup> > ··· of the CRM *μ*˜ and define the random measure

$$\tilde{\mu}\_{\Delta\_1} = \sum\_{i \ge 1} \frac{\Delta\_{i+1}}{\Delta\_1} \delta\_{\mathfrak{a}\_{i,i}}$$

normalizing *μ*˜ by the largest jump. The definition of SPs follows with a suitable change in the measure of Δ<sup>1</sup> (James et al. [14], Camerlenghi et al. [15]). Let us denote by L (· , *a*) a regular version of the conditional probability distribution of (Δ*i*+1/Δ1)*i*≥<sup>1</sup> given <sup>Δ</sup><sup>1</sup> = *<sup>a</sup>*. Now denote by <sup>Ψ</sup><sup>1</sup> a positive random variable with density function *<sup>f</sup>*Ψ<sup>1</sup> on <sup>R</sup><sup>+</sup> and define

$$\mathcal{Q}(\cdot) := \int\_{\mathbb{R}^+} \mathcal{Q}(\cdot, a) f \Psi\_1(a) \,\mathrm{d}a$$

The distribution of (Δ*i*+1/Δ1)*i*≥<sup>1</sup> is obtained by mixing <sup>L</sup> (· , *<sup>a</sup>*) with respect to the density function *f*Ψ<sup>1</sup> . Thus, we are ready to define an SP.

**Definition 3.** *A Scaled Process (SP) prior on* (W, W ) *is defined as the almost surely discrete random measure*

$$
\bar{\mu}\_{\Psi\_1} := \sum\_{i \ge 1} \bar{\eta}\_i \delta\_{\Psi\_{i'}} \tag{17}
$$

*where* (*η*˜*i*)*i*≥<sup>1</sup> *has distribution* <sup>L</sup> *and* (*w*˜*i*)*i*≥<sup>1</sup> *is a sequence of independent random variables with common distribution P, also independent of* (*η*˜*i*)*i*≥1*. We will write <sup>μ</sup>*˜Ψ<sup>1</sup> ∼ SP(*ν*, *<sup>f</sup>*Ψ<sup>1</sup> )*.*

A thoughtful account with a complete posterior analysis for SPs is given in Camerlenghi et al. [15]. Here, we characterize the predictive distribution (8) of SPs.

**Theorem 2** (Camerlenghi et al. [15], James [28])**.** *Let Z*1, *Z*2, ... *be exchangeable random variables modeled as in* (7)*, where* M *equals* SP(*ν*, *f*Ψ<sup>1</sup> )*. If Z<sup>n</sup> is a random sample that displays Kn* = *k distinct features* {*W*<sup>∗</sup> <sup>1</sup> , ... , *W*<sup>∗</sup> *Kn* }*, and feature W*<sup>∗</sup> *<sup>i</sup> appears exactly Mn*,*<sup>i</sup>* = *mi times in the samples, such as i* = 1, ... , *Kn, then the conditional distribution of* Ψ1*, given Zn, has posterior density:*

$$f\_{\Psi\_1|\mathbf{Z}\_n}(a) \propto e^{-\sum\_{i=1}^n \int\_0^1 z(1-s)^{n-1} a\lambda(as)ds} \prod\_{i=1}^k \int\_0^1 s^{m\_i} (1-s)^{n-m\_i} a\lambda(as)ds f\_{\Psi\_1}(a). \tag{18}$$

*Moreover, conditionally on Z<sup>n</sup> and* Ψ1*,*

$$\begin{split} \mathbb{P}\left( (Y\_{n+1}, A\_{n+1,1}^\*, \dots, A\_{n+1,K\_n}^\*) = (y, a\_1, \dots, a\_{K\_n}) | \mathbf{Z}\_n, \Psi\_1 \right) \\ = \text{Poisson}\left( y; \int\_0^1 s \Psi\_1(1-s)^n \lambda(s \Psi\_1) \text{ds} \right) \prod\_{i=1}^k \text{Bern}(a\_i; p\_i^\*(\Psi\_1)) \end{split} \tag{19}$$

*being*

$$p\_i^\*(\Psi\_1) := \frac{\int\_0^1 s^{m\_i+1} (1-s)^{n-m\_i} \lambda(s\Psi\_1) \mathrm{d}s}{\int\_0^1 s^{m\_i} (1-s)^{n-m\_i} \lambda(s\Psi\_1) \mathrm{d}s}.$$

**Proof.** The representation of the predictive distribution (19) follows from Camerlenghi et al. [15] (Proposition 2). Indeed, the posterior distribution of the largest jump directly follows from [15] (Equation (4)). In addition, the authors of [15] (Proposition 2) showed that the conditional distribution of *Zn*+1, given *Z<sup>n</sup>* and Ψ1, equals the distribution of the following counting measure:

$$Z\_{n+1}' + \sum\_{i=1}^{K\_n} A\_{n+1,i}^\* \delta\_{W\_i^\*} \tag{20}$$

where *Z <sup>n</sup>*+1|*μ*˜ = <sup>∑</sup>*i*≥<sup>1</sup> *<sup>A</sup> n*+1,*i δw*˜ *<sup>i</sup>* <sup>∼</sup> BeP(*μ*˜ Ψ1 ) and *μ*˜ <sup>Ψ</sup><sup>1</sup> is a CRM with Lévy intensity of the form

$$\nu\_{\Psi\_1}'(\mathrm{d}w, \mathrm{d}s) = (1-s)^n \Psi\_1 \lambda(\Psi\_1 s) \mathbf{1}\_{(0,1)}(s) \mathrm{d}s P(\mathrm{d}w).$$

Moreover, *A*∗ *<sup>n</sup>*+1,1, ... , *A*<sup>∗</sup> *<sup>n</sup>*+1,*Kn* are Bernoulli random variables with parameters *J*1, ... , *JKn* , respectively, such that conditionally on Ψ1, each *Ji* has a distribution with a density function of the form

$$f\_{l\_i|\Psi\_1}(s) \propto (1-s)^{n-m\_i} s^{m\_i} \Psi\_1 \lambda(\Psi\_1 s) \quad \text{on } (0,1).$$

As in the proof of Theorem 1, we show that the distribution of *Yn*+1|(Ψ1, *Zn*) equals ∑*i*≥<sup>1</sup> *A n*+1,*i* . Thus, by the evaluation of the Laplace functional, one may easily realize that the last random sum has a Poisson distribution with mean 1 <sup>0</sup> (<sup>1</sup> <sup>−</sup> *<sup>s</sup>*)*ns*Ψ1*λ*(Ψ1*s*)d*s*. Moreover, by exploiting the posterior representation (20), the variables *A*∗ *n*+1,*i* , such as *i* = 1, . . . , *Kn*, conditionally on *Zn* and Ψ, are independent and Bernoulli distributed with mean

$$\mathbb{E}[f\_i|\Psi\_1] = \int\_0^1 s f\_{f\_i|\Psi\_1}(s) \mathrm{d}s = \frac{\int\_0^1 (1-s)^{n-m\_i} s^{m\_i+1} \Psi\_1 \lambda(s\Psi\_1) \mathrm{d}s}{\int\_0^1 (1-s)^{n-m\_i} s^{m\_i} \Psi\_1 \lambda(s\Psi\_1) \mathrm{d}s}.$$

**Remark 1.** *According to* (18)*, the conditional distribution of* Ψ<sup>1</sup> *given Z<sup>n</sup> may include the whole sampling information, depending on the specification of ν and f*Ψ<sup>1</sup> *, and hence, the conditional distribution of Yn*+<sup>1</sup> *given Z<sup>n</sup> may also include such sampling information. As a corollary of Theorem 2, the conditional distribution of Yn*+<sup>1</sup> *given Z<sup>n</sup> is a mixture of Poisson distributions that may include the whole sampling information; in particular, the amount of sampling information in the posterior distribution is uniquely determined by the mixing distribution, namely by the conditional distribution of* Ψ1*, given Zn.*

Hereafter, we specialize Theorem 2 for the stable SP, that is, a peculiar SP defined through a CRM with a Lévy intensity *<sup>ν</sup>* such that *<sup>λ</sup>*(*s*) = *<sup>σ</sup>s*−1−*<sup>σ</sup>* for a parameter *<sup>σ</sup>* <sup>∈</sup> (0, 1). We refer to Camerlenghi et al. [15] for a detailed posterior analysis of the stable SP prior.

**Corollary 2.** *Let Z*1, *Z*2, ... *be exchangeable random variables modeled as in* (7)*, where* M *equals* SP(*ν*, *<sup>f</sup>*Ψ<sup>1</sup> )*, with <sup>λ</sup>*(*s*) = *<sup>σ</sup>s*−1−*<sup>σ</sup> for some <sup>σ</sup>* <sup>∈</sup> (0, 1)*. If <sup>Z</sup><sup>n</sup> is a random sample that displays Kn* = *k distinct features* {*W*<sup>∗</sup> <sup>1</sup> , ... , *W*<sup>∗</sup> *Kn* }*, and feature W*<sup>∗</sup> *<sup>i</sup> appears exactly Mn*,*<sup>i</sup>* = *mi times* *in the samples, such as i* = 1, ... , *Kn, then the conditional distribution of* Ψ1*, given Zn, has posterior density:*

$$f\_{\Psi\_1|\mathbf{Z}\_n}(a) \propto a^{-k\tau} e^{-\sigma a^{-\sigma} \sum\_{i=1}^n B(1-\sigma, i)} f\_{\Psi\_1}(a) \tag{21}$$

*having denoted by B*(· , ·) *the classical Euler Beta function. Moreover, conditionally on Z<sup>n</sup> and* Ψ1*,*

$$\begin{split} \mathbb{P}((\mathbf{Y}\_{n+1}, A^\*\_{n+1, 1}, \dots, A^\*\_{n+1, K\_n}) &= (y, a\_1, \dots, a\_{K\_n}) | \mathbf{Z}\_n, \mathbb{V}\_1) \\ &= \text{Poisson}(y; \sigma \mathbf{Y}\_1^{-\sigma} B(1 - \sigma, n + 1)) \prod\_{i=1}^k \text{Bern} \left( a\_i; \frac{m\_i - \sigma}{n - \sigma + 1} \right) . \end{split} \tag{22}$$

**Proof.** The proof is a plain application of Theorem 2 under the choice *λ*(*s*) = *σs*−1−*σ*.

#### **4. Predictive Characterizations for SPs**

In this section, we introduce and discuss Johnson's "sufficientness" postulates in the context of feature-sampling models under the class of SP priors. According to Theorem 1, if the feature-sampling model is a CRM prior, then the conditional distribution of *Yn*+1, given *Zn*, is a Poisson distribution that depends on the sampling information *Z<sup>n</sup>* only through the sample size *n*. Moreover, the conditional probability of generating an old feature *W*∗ *i* given *Z<sup>n</sup>* depends on the sampling information *Z<sup>n</sup>* only through *n* and *mi*. As shown in Theorem 2, SP priors enrich the predictive structure of CRM priors through the conditional distribution of the latent variable Ψ<sup>1</sup> given the observable sample *Zn*. In the next theorem, we characterize the class of SP priors for which the conditional distribution of *Yn*+<sup>1</sup> given *Zn* depends on the sampling information only through *n*.

**Theorem 3.** *Let Z*1, *Z*2,... *be exchangeable random variables modeled as in* (7)*, where* M *equals* SP(*ν*, *f*Ψ<sup>1</sup> ) *and ν*(d*w*, d*s*) = *λ*(d*s*)d*sP*(d*w*)*. Moreover, suppose that Z<sup>n</sup> is a random sample that displays Kn* = *k distinct features* {*W*<sup>∗</sup> <sup>1</sup> , ... , *W*<sup>∗</sup> *Kn* }*, and feature W*<sup>∗</sup> *<sup>i</sup> appears exactly Mn*,*<sup>i</sup>* = *mi times in the samples, such as <sup>i</sup>* <sup>=</sup> 1, ... , *Kn. If <sup>f</sup>*Ψ<sup>1</sup> : (0,*r*) <sup>→</sup> <sup>R</sup><sup>+</sup> *is a continuous function on the compact support* (0,*r*) *with <sup>r</sup>* <sup>&</sup>gt; <sup>0</sup>*, and the function <sup>λ</sup>* : <sup>R</sup><sup>+</sup> <sup>→</sup> <sup>R</sup><sup>+</sup> *is continuous on its domain, then the conditional distribution of the latent variable* Ψ<sup>1</sup> *given Z<sup>n</sup> depends on the sampling information Z<sup>n</sup> only through n if and only if λ*(*s*) = *Cs*−<sup>1</sup> *on* (0,*r*) *for some constant C* > 0*.*

**Proof.** First of all, if *f*Ψ<sup>1</sup> is defined on the compact support (0,*r*) and if *λ*(*s*) = *Cs*−<sup>1</sup> on (0,*r*) for some constant *C* > 0, then it is easy to see that the posterior distribution of Ψ<sup>1</sup> in (18) depends only on *n* and not on the other sample statistics. We now show the reverse implication. The posterior density of Ψ1, conditionally on *Zn*, satisfies (18), and it is proportional to

$$f\_{\Psi\_1|\mathbf{Z}\_n}(a) \propto \prod\_{i=1}^n e^{-\phi\_i(a)} \prod\_{i=1}^{K\_n} \int\_0^1 s^{m\_i} (1-s)^{n-m\_i} a\lambda(as) \mathrm{d}s \, f\_{\Psi\_1}(a),$$

where *<sup>φ</sup>i*(*a*) = 1 <sup>0</sup> *<sup>s</sup>*(<sup>1</sup> <sup>−</sup> *<sup>s</sup>*)*i*−1*aλ*(*as*)d*s*. Then, there exists *<sup>c</sup>*(*m*1, ... , *mk*, *<sup>k</sup>*, *<sup>n</sup>*) such that it holds that

$$f\_{\Psi\_1|\mathbf{Z}\_n}(a) = \frac{\prod\_{i=1}^n e^{-\phi\_i(a)} \prod\_{i=1}^{K\_n} \int\_0^1 s^{m\_i} (1-s)^{n-m\_i} a\lambda(as) \text{ds} \, f \Psi\_1(a)}{\mathfrak{c}(m\_1, \dots, m\_k, k, n)}. \tag{23}$$

Because of the assumptions imposed, the distribution of Ψ1|*Z<sup>n</sup>* does not depend on *Kn*, nor on the corresponding sample frequencies *Mn*,1,..., *Mn*,*Kn* . Accordingly, the function

$$f\_1(a,n) := f\_{\Psi\_1|\mathbf{Z}\_n}^{-1}(a) \prod\_{i=1}^n e^{-\phi\_i(a)}, \quad a \in (0,r), \tag{24}$$

depends only on *a* and *n*, but not on *k* and (*m*1, ... , *mk*). Then, putting together (23) and (24), it holds that

$$f\_1(a,n) \cdot \prod\_{i=1}^k \int\_0^1 s^{m\_i} (1-s)^{n-m\_i} a\lambda(as) \mathrm{d}s = c(m\_1, \dots, m\_k, n, k) \quad \forall a \in (0, r), \tag{25}$$

where *c* is the normalizing factor, and it does not depend on the variable *a*. By choosing *<sup>m</sup>*<sup>1</sup> <sup>=</sup> ... <sup>=</sup> *mk* <sup>=</sup> *<sup>n</sup>* <sup>∈</sup> <sup>N</sup>, thanks to Equation (25), we can state that the following function:

$$\left(f\_1(a,n)\left(\int\_0^1 s^n a\lambda(as)ds\right)^k\right)^k\tag{26}$$

which is defined for any *a* ∈ (0,*r*) and does not depend on *a*, but only on *k* and *n*. Since the previous assertion is true for any *k* ≥ 1, one may select *k* = 1, thus obtaining the following identity:

$$f\_1(a,n) = c^\* \left( \int\_0^1 s^n a \lambda(as) \mathrm{d}s \right)^{-1} \tag{27}$$

for some constant *c*∗, independent of *a*, but that may depend on *n*. Substituting (27) into (26), we obtain that

$$
\varepsilon^\* \left( \int\_0^1 s^n a \lambda(as) \mathbf{ds} \right)^{k-1} \tag{28}
$$

is a function that does not depend on *a*, but only on *n* and *k*. As a consequence, we have that

$$\int\_0^1 \mathbf{s}^n a \lambda(\mathbf{as}) \mathbf{ds} = \int\_0^a \frac{\mathbf{s}^n}{a^n} \lambda(\mathbf{s}) \mathbf{ds} = \mathbf{C}^{\*\*}.$$

for a suitable constant *C*∗∗, which does not depend on *a* ∈ (0,*r*). To conclude, we take a derivative of the previous expression with respect to *a*, and this allows us to show that

$$a^n \lambda(a) = na^{n-1} \mathbb{C}^{\*\*}{}\_{\prime}$$

namely, *λ*(*a*) = *C*/*a* for *a* ∈ (0,*r*), where *C* is a positive constant. This is a Lévy intensity; indeed, it satisfies the condition (11). Outside the interval (0,*r*), *λ* may be defined arbitrarily; indeed, the values of *λ* on [*r* + ∞) do not affect the posterior distribution of Ψ<sup>1</sup> (18).

**Remark 2.** *Note that in Theorem 3, we have supposed that f*Ψ<sup>1</sup> *has a compact support on* (0,*r*)*; thus, we are interested in defining λ on* (0,*r*)*; outside the interval, λ can be defined arbitrarily because it does not affect the posterior distribution* (18) *of* Ψ1*. From the proof of Theorem 3, it becomes apparent that if the support of <sup>f</sup>*Ψ<sup>1</sup> *is the entire positive real line* <sup>R</sup>+*, the posterior distribution of the largest jump depends only on n if and only if λ*(*s*) = *Cs*−<sup>1</sup> *on* R<sup>+</sup> *for some constant C* > 0*. However, in this case, λ does not meet the integrability condition* (11)*; hence, this can only considered a limiting case. It is interesting to observe that such a limiting situation, with the additional assumption f*Ψ<sup>1</sup> = *f*Δ<sup>1</sup> *, corresponds to the Beta process case with σ* = 0 *and c* = 1 *(Griffiths and Ghahramani [12]).*

Now, we characterize SPs for which the posterior distribution of Ψ<sup>1</sup> depends only on *n* and *Kn*, but not on the sample frequencies of the different features *m*. Here, we assume that *f*Ψ<sup>1</sup> has full support a priori. The following characterization has been provided in Camerlenghi et al. [15] (Theorem 3), but for completeness, we report the proof.

**Theorem 4** (Camerlenghi et al. [15])**.** *Let Z*1, *Z*2, ... *be exchangeable random variables modeled as in* (7)*, where* M *equals* SP(*ν*, *f*Ψ<sup>1</sup> ) *and ν*(d*w*, d*s*) = *λ*(d*s*)d*sP*(d*w*)*. Suppose that Z<sup>n</sup> is a random sample that displays Kn* = *k distinct features* {*W*<sup>∗</sup> <sup>1</sup> , ... , *W*<sup>∗</sup> *Kn* }*, and feature W*<sup>∗</sup> *<sup>i</sup> appears exactly Mn*,*<sup>i</sup>* <sup>=</sup> *mi times in the sample, such as <sup>i</sup>* <sup>=</sup> 1, ... , *Kn. If <sup>f</sup>*Ψ<sup>1</sup> : <sup>R</sup><sup>+</sup> <sup>→</sup> <sup>R</sup><sup>+</sup> *is a strictly* *positive function on* R<sup>+</sup> *and continuously differentiable, and λ is continuously differentiable, then the conditional distribution of the latent variable* Ψ1*, given Zn, depends on Z<sup>n</sup> only through n and Kn if and only if <sup>λ</sup>*(*s*) = *Cs*−1−*<sup>σ</sup> on* <sup>R</sup><sup>+</sup> *for some constant C* <sup>&</sup>gt; <sup>0</sup> *and <sup>σ</sup>* <sup>∈</sup> (0, 1)*.*

**Proof.** By arguing as in the proof of Theorem 3, the posterior density of Ψ<sup>1</sup> given *Z<sup>n</sup>* is proportional to

$$\prod\_{i=1}^{n} e^{-\phi\_i(a)} \prod\_{i=1}^{k} \int\_0^1 s^{m\_i} (1-s)^{n-m\_i} a\lambda(as) \mathrm{ds} \, f\_{\Psi\_1}(a) \, ds$$

where *<sup>φ</sup>i*(*a*) = 1 <sup>0</sup> *<sup>s</sup>*(<sup>1</sup> <sup>−</sup> *<sup>s</sup>*)*i*−1*aλ*(*as*)d*s*. Then, there exists *<sup>c</sup>*(*m*1, ... , *mk*, *<sup>n</sup>*, *<sup>k</sup>*) such that it holds that

$$f\_{\Psi\_1|\mathbf{Z}\_n}(a) = \frac{\prod\_{i=1}^n e^{-\phi\_i(a)} \prod\_{i=1}^k \int\_0^1 s^{m\_i} (1-s)^{n-m\_i} a\lambda(as) \text{ds } f\_{\Psi\_1}(a)}{c(m\_1, \dots, m\_k, n, k)}.$$

As a consequence,

$$f\_{\mathbf{V}\_1|\mathbf{Z}\_u}^{-1}(a)\prod\_{i=1}^n \varepsilon^{-\phi\_i(a)} \prod\_{i=1}^k \int\_0^1 s^{m\_i}(1-s)^{n-m\_i} a\lambda(as) \mathrm{ds}\, f\_{\mathbf{V}\_1}(a) = \varepsilon(m\_1, \dots, m\_k, n, k). \tag{29}$$

If the density function *<sup>f</sup>*Ψ1|*Z<sup>n</sup>* (*a*) does not depend on *<sup>m</sup>*1, ... , *mk*, then the following function

$$f\_{\Psi\_1|\mathbf{Z}\_n}^{-1}(a)\prod\_{i=1}^n e^{-\phi\_i(a)}f\_{\Psi\_1}(a) = f\_1(a,k,n)$$

depends only on *k*, *n* and *a*, but not on the frequency counts. Therefore, (29) boils down to

$$f\_1(a,k,n) \cdot \prod\_{i=1}^k \int\_0^1 s^{m\_i} (1-s)^{n-m\_i} a\lambda(as) \mathrm{ds} = c(m\_1, \dots, m\_k, n, k) \,. \tag{30}$$

where the function on the right-hand side of (30) is independent of *a* for any choice of the vector of sampling information (*m*1, ... , *mk*, *n*, *k*). Now, since the vector (*m*1, ... , *mk*, *n*, *k*) can be chosen arbitrarily, we can make the choice *m*<sup>1</sup> = ··· = *mk* = *m* > 0, such that the function

$$\left[ w(a,k,n) \int\_0^1 s^m (1-s)^{n-m} a\lambda(as) ds \right]^k \tag{31}$$

does not depend on *<sup>a</sup>* <sup>∈</sup> <sup>R</sup>+, where *<sup>w</sup>*(*a*, *<sup>k</sup>*, *<sup>n</sup>*) = *<sup>k</sup> <sup>f</sup>*1(*a*, *<sup>k</sup>*, *<sup>n</sup>*). Moreover, suppose that *m* = *n*; thus,

$$w(a,k,n)\int\_0^1 s^n a\lambda(as)ds\tag{32}$$

does not depend on *<sup>a</sup>* <sup>∈</sup> <sup>R</sup>+, which implies that

$$w(a,k,n) = c^\* \left( \int\_0^1 s^n a \lambda(as) ds \right)^{-1} \tag{33}$$

for a constant *c*∗ > 0 with respect to *a*, which can only depend on *k* and *n*. By substituting (33) into (31), we obtain

$$\left[\frac{\mathbf{c}^\*}{\int\_0^1 s^n \lambda(as)\mathbf{ds}} \cdot \int\_0^1 \mathbf{s}^m (1-s)^{n-m} \lambda(as)\mathbf{ds} \right]^k \mathbf{s}$$

which is independent of *<sup>a</sup>* <sup>∈</sup> <sup>R</sup>+. Now, it is possible to choose *<sup>m</sup>* <sup>=</sup> *<sup>n</sup>* <sup>−</sup> 1 in the previous function. Therefore, there exists a constant *c*∗∗ independent of *a* such that the following identity holds:

$$\int\_0^1 \mathbf{s}^{n-1} \lambda \, (\mathbf{as}) \, \mathbf{ds} - \int\_0^1 \mathbf{s}^n \lambda \, (\mathbf{as}) \, \mathbf{ds} = \mathbf{c}^{\ast \ast} \int\_0^1 \mathbf{s}^n \lambda \, (\mathbf{as}) \, \mathbf{ds}.$$

By taking the derivative of the previous equation two times with respect to *a*, one obtains

$$
\lambda(a)(1 - n c^{\*\*}) = a \lambda'(a) c^{\*\*},
$$

which is an ordinary differential equation in *λ* that can be solved by separation of variables. In particular, we obtain

$$
\lambda(a) = \mathbb{C} a^{(1 - n c^{\*\*}) / c^{\*\*}}, \quad \text{for } \mathbb{C} > 0. \tag{34}
$$

To conclude, observe that the exponent of *a* in (34) should satisfy the integrability condition (11) for homogeneous CRMs. Accordingly, it is easy to see that we must consider

$$\lambda(a) = \mathbb{C} \frac{1}{a^{1+\sigma}}$$

where *C* > 0 and *σ* ∈ (0, 1). The reverse implication of the theorem is trivially satisfied; hence, the proof is completed.

We recall from Theorem 2 that the conditional distribution of Ψ<sup>1</sup> given *Z<sup>n</sup>* uniquely determines the amount of sampling information included in the conditional distribution of the number of new features *Yn*+<sup>1</sup> given *Zn*. Such sampling information may range from the whole information, in terms of *n*, *Kn*, and (*M*1,*n*, ... , *MKn*,*n*), to the sole information on the sample size *n*. According to Theorem 4, the stable SP prior of Corollary 2 is the sole SP prior for which the conditional distribution of the number of new features *Yn*+<sup>1</sup> given *Z<sup>n</sup>* depends on the sampling information *Z<sup>n</sup>* only on *n* and *Kn*. Moreover, according to Theorem 3, the Beta process prior is the sole SP prior for which the conditional distribution of the number of new features *Yn*+<sup>1</sup> given *Z<sup>n</sup>* depends on the sampling information *Z<sup>n</sup>* only on *n*. In particular, Theorems 3 and 4 show that the Beta process prior and the stable SP prior may be considered, to some extent, the feature sampling counterparts of the Dirichlet process prior the Pitman–Yor process prior.

#### **5. Discussion and Conclusions**

In this paper, we have introduced and discussed Johnson's "sufficientness" postulates in the context of feature-sampling models. "Sufficientness" postulates have been investigated extensively in the context of species-sampling models, providing an effective classification of species-sampling models on the basis of the form of their corresponding predictive distributions. Here, we made a first step towards the problem of providing an analogous classification for feature-sampling models. In particular, we obtained Johnson's "sufficientness" postulates when the class of feature-sampling models is restricted to the class of scaled process priors. However, the results presented in the paper remain preliminary, and do not at all provide a complete answer to the characterization problem within the general class of feature-sampling models. This problem remains open.

Within the feature-sampling setting, the predictive distribution is of the form (8), though for the purpose of providing "sufficientness" postulates, one may focus on featuresampling models exhibiting a general predictive distribution of the following type:

$$\begin{split} \mathbb{P}((Y\_{n+1}, A\_{n+1,1}^\*, \dots, A\_{n+1,\mathbb{K}\_n}^\*) = (y, a\_{1'}, \dots, a\_{\mathbb{K}\_n}) | \mathbf{Z}\_n) \\ = \operatorname{g}(y; n, k, m) \prod\_{i=1}^k f\_i(a\_i; n, k, m). \end{split} \tag{35}$$

Note that (35) is a probability distribution, and it must satisfy a consistency condition, as usual. Among all the feature-sampling models whose predictive distribution can be written in the form (35), we are interested in characterizing nonparametric priors such that: (i) The function *g* depends on the sampling information only through *n*, and the function *fi* depends only on (*n*, *mi*); (ii) *g* depends only on (*n*, *k*) and *fi* depends only on (*n*, *mi*); (iii) *g* depends only on (*n*, *k*) and *fi* depends only on (*n*, *k*, *mi*). In our view, these characterizations may provide a complete picture of sufficientness postulates within the feature setting, and they are also fundamental to guiding the selection of the prior distribution. We conjecture that CRMs are the nonparametric priors satisfying the characterization (i), the SP with a stable Lévy measure is an example of prior satisfying (ii), and no examples satisfying (iii) have been considered in the current literature. Results in this direction are in Battiston et al. [31], where the authors characterize an exchangeable feature allocation probability function (Broderick et al. [32]) in product forms; this could be a stimulating point of departure to study the characterization problem depicted above.

**Author Contributions:** Writing–original draft, F.C. and S.F.; writing–review and editing, F.C. and S.F. The authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program under grant agreement No. 817257.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** F.C. is extremely grateful to Eugenio Regazzini for the time spent at the Department of Mathematics of University of Pavia during his Ph.D. studies in Mathematical Statistics; F.C. wants to especially thank Eugenio Regazzini for having introduced him to the study of Bayesian Statistics with a stimulating Ph.D. course held together with Antonio Lijoi. S.F. wishes to express his gratitude to Eugenio Regazzini, whose fundamental contributions to Bayesian statistics have always been a great source of inspiration, transmitting enthusiasm and methods for the development of his own research. The authors gratefully acknowledge the financial support from the Italian Ministry of Education, University, and Research (MIUR), "Dipartimenti di Eccellenza" grant 2018-2022. F.C. is a member of the *Gruppo Nazionale per l'Analisi Matematica, la Probabilità e le loro Applicazioni* (GNAMPA) of the *Istituto Nazionale di Alta Matematica* (INdAM).

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**

