Understanding Hierarchical Processes

Buntine, Wray

doi:10.3390/e24121703

Open AccessArticle

Understanding Hierarchical Processes

by

Wray Buntine

^1,2

¹

College of Engineering and Computer Science, VinUniversity, Hanoi 100000, Vietnam

²

Faculty of Data Science and AI, Monash University, Clayton, VIC 3800, Australia

Entropy 2022, 24(12), 1703; https://doi.org/10.3390/e24121703

Submission received: 13 September 2022 / Revised: 16 November 2022 / Accepted: 18 November 2022 / Published: 22 November 2022

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Hierarchical stochastic processes, such as the hierarchical Dirichlet process, hold an important position as a modelling tool in statistical machine learning, and are even used in deep neural networks. They allow, for instance, networks of probability vectors to be used in general statistical modelling, intrinsically supporting information sharing through the network. This paper presents a general theory of hierarchical stochastic processes and illustrates its use on the gamma process and the generalised gamma process. In general, most of the convenient properties of hierarchical Dirichlet processes extend to the broader family. The main construction for this corresponds to estimating the moments of an infinitely divisible distribution based on its cumulants. Various equivalences and relationships can then be applied to networks of hierarchical processes. Examples given demonstrate the duplication in non-parametric research, and presents plots of the Pitman–Yor distribution.

Keywords:

Bayesian nonparametrics; Dirichet process; gamma process; Pitman–Yor process; hierarchical process; non-parametric LDA

1. Introduction

The hierarchical Pitman–Yor process (HPYP) was first presented as a solution to n-gram language models [1] where it mimics the behavior of the Kneser–Ney algorithm [2]. It is an extension of the hierarchical Dirichlet process (HDP) [3]. The HPYP has since been used in a wide variety of ways, including for previously state-of-the-art and competitive algorithms for topic models [4] and text compression [5]. The HDP has been used for previously state-of-the-art and competitive algorithms for tweet clustering [6] and document segmentation [7]. Many more novel and creative uses of these processes exist, for instance, hierarchical topic models [8]. More general reviews are given by Teh and Jordan [9] and Jordan [10]. The gamma process can also be used hierarchically [11] and provides an alternative scheme for handling the HDP. The notion of hierarchical models fits in well with the computational approach to statistical modelling adopted in the machine learning community.

However, what exactly is the HPYP? A key concept for understanding the HDP and the HPYP is the notion of a discrete base probability measure. The base measure is a source measure for sampling points of the HDP or HPYP. These are discrete just when they have a countable number of possible points (the set on which the measure is based is countable). When finite, the base probability measure is just a probability distribution, usually represented as a vector. However, in non-parametric modelling, we seek to model structured objects for which the dimension may be unknown ahead of time: the number of clusters for points, the depth of a tree, the number of atoms in a molecule, the number of words in a sentence. Allowing the base measure to be countably infinite is a useful abstraction in this situation. Moreover, being able to generate an infinite discrete base probability measure provides us with the ability to model prior distributions for our structured objects without fixing dimensions ahead of time. The above models for text and clustering give examples.

It is known that the hierarchical Dirichlet process, when applied to a finite discrete base distribution, is just a Dirichlet distribution. Indeed, this property is the axiomatic definition of the process [12]. So, applications of and inference with the HDP are really just using hierarchical Dirichlet distributions, requiring no non-parametric theory to describe, although algorithms may be using non-parametric methods.

So, there is a clear concept of what the HDP model is. What is the corresponding result for the hierarchical Pitman–Yor process? For all the algorithms using the HPYP, it would be nice to know what their actual model is! Teh first referred to the hierarchical version of the PYP as the Pitman–Yor distribution [in talks accompanying] [1], saying it has “no known analytical form”. Moreover, is there a more general theory of hierarchical processes, and why does this case (the HDP) come out so neatly? These questions for hierarchical processes have been addressed in recent theory [13,14,15].

Note the Bayesian theory of non-hierarchical processes is extensive. A comprehensive analysis of different processes is developed by James [16], in the more general context of the generalised Indian buffet process [17]. The general posterior analysis of their normalised versions, including the DP, is developed by James et al. [18]. A useful review of theory and a slice sampler for the case of the normalised generalised gamma is given in Lomeli et al. [19]. A study of some of the processes considered here can also be found in Zhou and Carin [11], focusing on gamma processes and their relationships.

However, these treatments are grounded in extensive probability theory and assume the reader is already familiar with Poisson point processes, Lévy processes, subordinators and other advanced areas [20,21]. Some of these details are not strictly necessary for the understanding of the basic ideas. This paper presents the relevant background theory in a self-contained way to develop models for hierarchical processes generally based on the theory of subordinators and completely random measures [20,21]. The theory for the most part reinterprets results from the Bayesian non-parametric and statistical communities [18,22,23], though some related ideas can also be found in machine learning [11]. However, the answers to the questions about the nature of the HPYP and general application to hierarchical processes, networks of hierarchical processes and generalised Chinese restaurants are not well-known outside the Bayesian non-parametric community, so we present them here in a unified manner.

2. Background Theory

A formal theory of Poisson point processes (PPP), Lévy processes and completely random measures (CRMs) with treatment of measure theory is needed to rigorously cover this area [20,21]. Here, an informal summary is given, though trying to maintain a degree of precision, for instance keeping adequate rigor in the statement of results.

2.1. Completely Random Measures

A CRM is a discrete measure

μ (d x)

on a space

X

constructed as

μ (x) = C_{0} + \sum_{i = 1}^{\infty} λ_{i} δ_{x_{i}} (x)

(1)

where the

x_{i} \in X

are called atoms and are assumed distinct, the

λ_{i} \in R^{+}

called jumps, and the background constant

C_{0}

is zero in our use. This means that

μ (x_{i}) = λ_{i}

, evaluated at atoms, and

μ (x_{i}) = 0

otherwise. The

(λ_{i}, x_{i})

are mutually independent random variables, and a finite number of the

x_{i}

can also be fixed. These conditions ensure the measure is completely random, that is for

A, B \subset X

, if

A \cap B = \emptyset

then

μ (A) ⊥ ⊥ μ (B)

.

Moreover, suppose the class of CRMs where

C_{0} = 0

in Equation (1) can be normalised, so

μ (X) = \sum_{i = 1}^{\infty} λ_{i} < \infty

. This yields discrete probability distributions on

X

represented as

μ (x) / μ (X)

. These are referred to as normalised random measures with independent increments (NRMIs) [18], a concept developed by Kingman [24], and are a general class of discrete probability distributions.

2.2. Poisson Point Process

A Poisson point process (PPP) is a stochastic process whose samples represent sets of independent events on a measurable space

X

. For a sample, the count of events in

A \subseteq X

is denoted

N (A) \in N

. Events are considered to be a countable subset of

X

, only significant if

X

is not countable, for instance the real line. The PPP has complete independence, so for

A, B \subset X

, if

A \cap B = \emptyset

then

N (A) ⊥ ⊥ N (B)

and

N (A \cup B) = N (A) + N (B)

. The sample is specified by a rate

ρ (d x)

which is any measure on

X

. In PPP theory, the rate is referred to as a Lévy measure. The PPP has the defining property that

N (A) \sim Poisson (ρ (A))

, and samples can be generated from this by working with an ever finer partition of the space

X

.

A special class of PPP can be used as a family of priors for a CRM. Assume a PPP has rate

ρ (d λ) μ (d x)

for

λ \in R^{+}

and

x \in X

. This is called homogeneous because the terms in

λ

and x are independent [18]. In the case considered here, the

μ (d x)

is a measure on

X

called a base measure, and the rate

ρ (d λ)

has the condition

\int_{0}^{\infty} min (1, λ) ρ (d λ) < \infty

to make everything work neatly [20], as follows: This condition is equivalent to

\int_{0}^{\infty} min (ϵ, λ) ρ (d λ) < \infty

for any

0 < ϵ < \infty

. As a consequence,

ρ ([ϵ, \infty))

is bounded, meaning there will be a finite number of points with

λ > ϵ

in the sample of the PPP (within a finitely measured subset of

X

) and

\int_{0}^{ϵ} λ ρ (d λ)

is bounded, meaning the sum of the

λ

’s less the

ϵ

in the sample of the PPP (within a finitely measured subset of

X

) will be finite even if there is an infinite number of them. Then, a sample from the PPP is a countable set of points which can be used to constuct a CRM.

2.3. Example Processes

Consider a number of standard PPPs used to construct CRMs [21]: the generalised (three-parameter) beta process [25], the generalised (three-parameter) gamma process [26] and the stable process. These have the forms given in Table 1, where M is a constant background rate. They are given without specifying a base measure on

X

, which could be given as a final parameter.

The Poisson process and the negative binomial process [11] are also included in Table 1. Both are used in the hierarchical context in Section 4.

The first three processes in Table 1 are widely used in various forms in the non-parametric Bayesian and machine learning communities. From a Bayesian perspective, they are best thought of as improper priors corresponding to the beta, gamma and gamma distributions, respectively. This analysis is presented later in Section 3.4.

NRMIs can be created by normalising CRMs. These are sometimes generated directly from distributions consisting of a normalised discrete set of weights as probabilities. So, generating the

\vec{λ}

according to a generalised (or three parameter) gamma process,

GGP (M, α, β)

, and then normalising yields, what is called a normalised generalised gamma process (NGG). The normalised generalised gamma process (NGG) is constructed analogously to the Dirichlet process, which normalises the gamma process. They represent the main examples of NRMIs. These NRMIs, however, are not paired with base measures when forming a discrete process on

X

, rather they need to be paired with base distributions

Pr (x)

since only one point is generated per sample. Denote the NGG process as

NGG (α, β, M)

or

NGG (α, β, M, h (\cdot))

, where

α, β, M

are as described for the GGP, line 3 of Table 1, and

h (\cdot)

is a base distribution. The DP is effectively the case when

α = 0

.

Traditionally, the parameter vector part of the DP in Equation (1) is called a GEM distribution (specifically, when a size-biased order is used [27]), named after Griffiths, Engen and McCloskey [28]. This can be represented as an infinite vector

\vec{λ} = (λ_{1}, λ_{2}, \dots)

. Correspondingly, there is a two-parameter version of

\vec{λ}

corresponding to the PYP,

GEM (α, β)

, which has discount

0 \leq α < 1

and concentration

β > - α

. Then,

GEM (0, β)

is the original GEM. Including the base distribution

h (\cdot)

yields

DP (β, h (\cdot))

and

PYP (α, β, h (\cdot))

.

The Pitman–Yor process itself was developed by Pitman and Yor [28], and a general scheme for developing related models is by Pitman [29], called Poisson–Kingman models. However, as to be shown, the hierarchical PYP is very different from the PYP, so this theory is not entirely relevant for the hierarchical case. Alternatively, in Pitman and Yor [28] ([Proposition 21]), it was shown that a PYP can be developed by marginalising out a parameter of the NGG as follows.

Lemma 1.

(Deriving a PYP from a NGG) Let

μ (x) \sim N G G (α, M, h (\cdot))

for

α, M > 0

and suppose

M \sim g a m m a (β / α, 1)

for

β > 0

, then it follows that

μ (x) \sim P D P (α, β, h (\cdot))

.

The result is presented rather indirectly in Pitman and has been re-expressed by several authors [23] ([Section 3.1.1]), [30] ([Corollary 1]), and leads to a different class of models to the Poisson–Kingman models called Poisson-gamma models [23].

Notice the lemma restricts the PYP to the case where the concentration is positive. More generally, PYPs can have concentration

β > - α

. When

β = 0

and

α > 0

, then the PYP is formed from normalising a positive stable distribution.

3. Defining Processes Axiomatically

This section gathers together some definitions and theory in order to present a general class of processes built on CRMs that can be treated hierarchically analogous to the Dirichlet process.

3.1. Subordinators

A simple useful case of these PPPs has the domain

X

being

R^{+}

, the positive real line, and is constant for

X

, so the rate is

ρ (d λ)

for

λ, x \in R^{+}

. For this, define a new process for our case

C_{0} = 0

given by the cumulative values,

σ_{t} = μ ((0, t]) = \sum_{i = 1}^{\infty} λ_{i} δ_{x_{i} \leq t}

So,

σ_{0} = 0

and

σ_{t}

increases in steps as each distinct

x_{i}

is passed. This

σ_{t}

corresponds to the class of so-called pure jump driftless subordinators, which are a kind of nondecreasing Lévy process, which in turn are processes with stationary independent increments [20]. The key relationship that underlies the general theory of these processes is that

σ_{t}

is distributed according to a particular infinitely divisible non-negative distribution, explained in Theorem 1. Examples are given in Table 1. So, for instance, for the generalised gamma process with parameters

(M, α, β)

, the total

σ_{1} = \sum_{i = 1}^{\infty} λ_{i} δ_{x_{i} \leq 1}

is distributed as a Tweedie distribution with parameters

(α, M^{1 / α}, β)

.

The basic connection is given as follows, a special case of the Lévy–Khintchine formula for subordinators. This uses the Laplace exponent of a 1D random variable y defined as the function (of u)

ε [e^{- u y}]

, which is related to the characteristic function.

Theorem 1.

Consider

σ_{t}

defined as previously by a PPP with rate

ρ (d λ)

for

λ, x \in R^{+}

and

ρ (d λ)

satisfying

\int_{0}^{\infty} min (1, λ) ρ (d λ) < \infty

. The Laplace exponent of

σ_{t}

is given by

ε [e^{- u σ_{t}}] = e^{- t ψ (u)}

where

ψ (u) = \int_{(0, \infty)} (1 - e^{u λ}) ρ (d λ)

. This form means that

σ_{t}

has an infinitely divisible non-negative distribution. The t here can be referred to as the parameter for divisibility, occurring in any infinitely divisible distribution.

Thus, given a rate

ρ (d λ)

defining a particular

σ_{t}

, one can derive its Laplace exponent

ψ (u)

and then infer the distribution on

σ_{t}

(where analytically possible). Note the scaling term M in Table 1 plays the role of t.

Some instances of this pairing, an infinitely divisible non-negative distribution with a corresponding rate are given in the last two columns of Table 1. Note that distributions corresponding to the generalised beta process are not well-known. Other distributions that could be included in the table are the inverse beta distribution (the beta distribution is not infinitely divisible but its inverse is), which includes the Pareto and F-distributions, and the generalised inverse gamma distribution [31].

3.2. Axiomatic Definitions

To extend Theorem 1 to broader classes of base distributions on general domains

X

, not just the positive real line with constant measure used in subordinators, one can give an axiomatic definition of a process based on an infinitely divisible non-negative distribution:

The derived process is a CRM,
The process behaves like the given infinitely divisible distribution on subsets of $X$ .

Definition 1.

(Axiomatic definition of a CRM process) Consider an infinitely divisible non-negative distribution

G (μ)

, where μ is the parameter for divisibility. Further assume its Laplace exponent has zero drift. Given a measurable space

X

, positive intensity M and measure

h (d x)

on

X

, consider a stochastic process denoted

GP (M, h (\cdot))

induced by

G (μ)

as follows.

X \sim G P (M, h (\cdot))

yields a CRM on

X

such that

1.: For $A, B \subset X$ , if $A \cap B = \emptyset$ then $X (A) ⊥ ⊥ X (B)$ ,
2.: For $A \subseteq X$ , $X (A) \sim G (M h (A))$ .

The first condition implies that the measures are CRMs as per Equation (1). The second condition implies one can construct the discrete measures iteratively, on an ever finer, nested sequence of partitions using the distribution

G ()

. Alternatively, one can use the Lévy–Khintchine formula of Theorem 1 to show the existence of a corresponding rate yielding a CRM with rate

M h (d x) ρ (d λ)

which must then satisfy the conditions.

Note that the Dirichlet process can be defined axiomatically [12], akin to Definition 1 with the Dirichlet distribution used instead of the gamma distribution, and base probability distribution used instead of a base measure. This axiomatic construction generalises for any infinitely divisible non-negative distribution as follows:

Definition 2.

(Axiomatic definition of an NRMI process) Consider an infinitely divisible non-negative distribution

G (μ)

, where μ is the parameter for divisibility. Further assume its Laplace exponent has zero drift. Consider as well the distribution on probability vectors induced by generating K values

ζ_{k} \sim G (μ_{k})

and normalising to obtain

(\frac{ζ_{1}}{\sum_{k = 1}^{K} ζ_{k}}, \dots, \frac{ζ_{K}}{\sum_{k = 1}^{K} ζ_{k}}) .

Denote this distribution by

{N G}_{K} (\vec{μ})

, where

\vec{μ}

is the vector of K value

μ_{k}

, given a measurable space

X

, positive intensity M and probability distribution

h (d x)

on

X

. A process denoted

N G P (M, h (\cdot))

, developed from

G (μ)

, is defined as follows. It is a stochastic process whose sample is a probability measure on

X

such that if

C \sim N G P (M, h (\cdot))

then for any finite partition

A_{1}, \dots, A_{K}

of

X

, and count

N > 0

,

(C (A_{1}), \dots, C (A_{K})) \sim m u l t i n o m i a l (N, {N G}_{K} (M h (A_{1}), \dots, M h (A_{K})))

.

In this way, a multinomial process can be defined axiomatically, as done by Zhou and Carin [11] [Corollary IV2]. One uses

MP (N, h (\cdot))

where

N \in N^{+}

is the total count and

h (\cdot)

a probability measure. The axiomatic part is

(X (A_{1}), \dots, X (A_{K})) \sim multinomial (N,

(h (A_{1}), \dots, h (A_{K})))

. Similarly, a Dirichlet compound multinomial (DCM) process can be defined, denoted as

DCMP (N, h (\cdot))

, where the axiomatic part is

(X (A_{1}), \dots, X (A_{K})) \sim DCM (N, (h (A_{1}), \dots, h (A_{K})))

. These correspond to a PPP and a NBP, respectively, both given in Table 1, where one has also conditioned on the total count being N.

3.3. On the Tweedie Distribution

From Table 1, the marginal distribution for the generalised gamma process is the Tweedie distribution [32] with exponent

α

, or sometimes expressed as index

p = 1 + \frac{1}{1 - α}

which has

p > 2

necessarily. For

α = 0

, the Tweedie distribution becomes a gamma distribution.

The Tweedie distribution with exponent

0 < α < 1

is formed from a positive stable distribution defined in terms of the stable distribution with characteristic exponent

α

, scale parameter

s = M^{1 / α}

location zero and symmetry one [33]. This distribution, denoted as

pstable (α, s)

, has the functional form [adding a scale to the standard formula of] [34] given by the remarkable formula

\begin{matrix} Pr (x | p s t a b l e (α, s)) & = & \frac{α}{1 - α} \frac{1}{s π} {(x / s)}^{- \frac{1}{1 - α}} \int_{0}^{π} a_{α} (ν) e^{- {(x / s)}^{- \frac{α}{1 - α}} a_{α} (ν)} d ν \\ w h e r e a_{α} (ν) & = & \frac{sin ((1 - α) ν) {(sin (α ν))}^{α / (1 - α)}}{{sin (ν))}^{1 / (1 - α)}}, \end{matrix}

which yields a simple ingenious sampling formula [34]. To obtain a Tweedie distribution, “exponentially tilt” the

p s t a b l e (α, s)

, calculated by multiplying by

e^{- β x}

and renormalising. The construction of exponentially tilting the distribution (see for instance Pitman [29]) gives the following:

Pr (x | Twe (α, s, β)) = e^{{(s β)}^{α} - β x} Pr (x | p s t a b l e (α, s)) .

Here, the term

e^{{(s β)}^{α} - β x}

is added to achieve normalisation.

3.4. Bayesian Analysis

A complete Bayesian analysis of CRMs and NRMIs has been developed by James [16] and James et al. [18], respectively, in the non-hierarchical context. This models the standard framework in which hierarchical DPs or hierarchical PYPs are used, but also applies to the Indian buffet process [17]. This is informally developed below so that their theoretical results can be subsequently used. By Bayesian analysis, the following is meant: one has an infinitely divisible distribution suitable for use with Theorem 1. One samples a CRM from this with unknown parameters of rates

\vec{λ}

and atoms

x_{i}

. Now, hierarchically sample sets of atoms from this CRM using a PPP. Each hierarchical sample from the CRM is a discrete set

A \subseteq X

, and multiple samples are drawn. Then, the task is to estimate the parameters of the parent CRM.

A CRM is represented in the form

μ (x) = \sum_{i = 1}^{\infty} λ_{i} δ_{x_{i}} (x)

for

x \in X

where the

x_{i}

are distinct and is generated according to a homogeneous PPP with rate

ρ (d λ) ω (d x)

where

ρ (d λ)

is a rate satisfying the conditions of Theorem 1. One then takes J samples from this according to a PPP, so

{\vec{n}}_{j} \sim PPP (μ (\cdot))

for

j = 1, \dots, J

. Each sample will be a finite subset of the atoms, some possibly occurring multiple times. For representational purposes, post hoc reorder the atoms of

μ (x)

so that only the first I have non-zero counts. So, for

I < i \leq \infty

, none of the samples

{\vec{n}}_{j}

contain

x_{i}

. The count of atom

x_{i}

in sample j is represented as

n_{j, i}

, so the condition

{\vec{n}}_{1 : J, i} \neq \vec{0}

means that at least one of the J samples contains an atom

x_{i}

.

The following informal analysis is offered as an explanation, but formal proofs are in James [16]. To make analysis feasible, we have to convert the rate

ρ (d λ)

to one with finite total measure. James [16] ingeniously and elegantly presents this by viewing the posterior for

μ_{i}

after seeing the evidence of having at least one non-zero value in the J values, so

{\vec{n}}_{1 : J, i} \neq \vec{0}

. For the particular sampling distribution of

n_{j, i}

, in our case a

Poisson (λ_{i})

,

Pr ({\vec{n}}_{1 : J, i} \neq \vec{0} | λ_{i}) = 1 - e^{- J λ_{i}}

which has a term in

λ_{i}

so the posterior rate

Pr ({\vec{n}}_{1 : J, i} \neq \vec{0} | λ_{i}) ρ (d λ_{i})

obtains finite total measure. Denote this total by

Ψ_{J} = \int Pr ({\vec{n}}_{1 : J, i} \neq \vec{0} | λ_{i}) ρ (d λ_{i})

. Then, working entirely with finite PPPs, one can compute the marginal. First, we generate the number of non-zero atoms I (for the given sample count J) by a Poisson and then generate the vector of counts for each atom

{\vec{n}}_{1 : J, i}

, like so

\begin{matrix} Pr ({\vec{n}}_{1}, \dots, {\vec{n}}_{J} | ρ (d λ), PPP) & = & e^{- Ψ_{J}} \frac{Ψ_{J}^{I}}{I!} \prod_{i = 1}^{I} Pr ({\vec{n}}_{1 : J, i} | {\vec{n}}_{1 : J, i} \neq \vec{0}) \\ = & e^{- Ψ_{J}} \frac{Ψ_{J}^{I}}{I!} \prod_{i = 1}^{I} \frac{\int Pr ({\vec{n}}_{1 : J, i} | λ) ρ (d λ)}{\int Pr ({\vec{n}}_{1 : J, i} \neq \vec{0} | λ) ρ (d λ)} \\ = & e^{- Ψ_{J}} \frac{1}{I!} \prod_{i = 1}^{I} \int Pr ({\vec{n}}_{1 : J, i} | λ) ρ (d λ), \end{matrix}

(2)

where the term

I!

can be removed if one considers that the atoms are ordered. With similar reasoning, one obtains:

the posterior rate of $λ_{i}$ : for $i \leq I$ has rate $Pr ({\vec{n}}_{1 : J, i} | λ_{i}) ρ (d λ_{i})$ ,
the posterior rate of the remainder CRM: $μ_{R} (x) = \sum_{i = I + 1}^{\infty} λ_{i} δ_{x_{i}} (x)$ , has rate
$Pr ({\vec{n}}_{1 : J} \neq \vec{0} | λ) ρ (d λ) ω (d x)$ ,
the total rate of the remainder CRM: $T_{R} = \sum_{i = I + 1}^{\infty} λ_{i}$ as given by Theorem 1.

The key formula for this kind of analysis is given in our context in Table 2.

The first line, the beP-BP case, is the three parameter beta process with Bernoulli data, which is the three parameter Indian buffet process. The second line is the gamma process with Poisson data. Note the data marginals

\int Pr ({\vec{n}}_{1 : J, i} | λ) ρ (d λ)

in our context can be obtained more directly, developed in Section 4.2, so formulas are not given.

4. Using Discrete Base Distributions

It is important to understand what happens when you use a discrete distribution as a base distribution to a CRM, since this is what happens when hierarchical constructions of these processes are made. Let the base measure on

X

have the form

μ (x) = \sum_{i = 1}^{\infty} λ_{i} δ_{x_{i}} (x)

, and the CRM is constructed using a homogeneous PPP with rate

ρ (d λ) ω (d x)

. What happens? This section considers various implications of this. Note different but more extensive treatment of this scenario for the results on moments, Section 4.2, and the generalised Chinese restaurant process, Section 4.4, is given by Camerlenghi et al. [14], Argiento et al. [15]. They also include example MCMC sampling algorithms.

4.1. General Results

Superposition of PPPs says to decompose a discrete CRM into a union of trivial PPPs each with rate in the form

μ_{i} ρ (λ) δ_{x_{i}}

, so the

X

component is a delta function. The resultant CRM is also trivial and takes the form, using Definition 1,

Λ δ_{x_{i}}

, where

Λ

is the total of the

λ_{k}

generated using the rate

μ_{i} ρ (λ)

. This total is distributed as the corresponding marginal distribution for the subordinator with intensity parameter

μ_{i}

, as per Theorem 1.

Lemma 2.

(CRM when base measure is discrete) Let a discrete measure on

X

have the form

μ (x) = \sum_{i = 1}^{\infty} μ_{i} δ_{x_{i}} (x)

for

x \in X

where the

x_{i}

are distinct, and a homogeneous CRM is constructed by sampling using a PPP with rate

ρ (d λ) μ (d x)

on

R^{+} \times X

. Let

Γ (t)

be the marginal total distribution for the corresponding subordinator, where t is the parameter of divisability. Then, the CRM has the form

γ (x) = \sum_{i = 1}^{\infty} γ_{i} δ_{x_{i}} (x)

(3)

where the random variable

γ_{i} \sim Γ (μ_{i})

, and the

x_{i}

are inherited from

μ (\cdot)

.

The CRM

μ (\cdot)

when used as a base distribution for a PPP is mapped element-wise to form a new CRM

γ (\cdot)

. So, no PPP modelling is required if you know the form of the element-wise distribution.

There are a number of very convenient and well-known properties of the Dirichlet that allow it to be used in hierarchical contexts. As it happens, most of these properties also hold for other NRMIs with discrete base measures, and some for CRMs, so these results are developed here. The first property is aggregation. This has that if

(x_{1}, x_{2}, x_{3}) \sim Dirichlet (α_{1}, α_{2}, α_{3})

, then

(x_{1}, x_{2} + x_{3}) \sim Dirichlet (α_{1}, α_{2} + α_{3})

, and this applies for a Dirichlet of any dimension. The second property is renormalisation and has that if

(x_{1}, x_{2}, x_{3}) \sim Dirichlet (α_{1}, α_{2}, α_{3})

then

(x_{1}, x_{2}) / (x_{1} + x_{2}) \sim Dirichlet (α_{1}, α_{2})

. Both properties clearly follow from the fact that a Dirichlet is a normalised Gamma, and by analogy hold for NRMIs too.

Definition 3.

(Aggregation property) Consider a process that takes a measure as an input parameter and outputs another measure. The process has the aggregation property if when

\sum_{i = 1}^{\infty} γ_{i} δ_{x_{i}} (x)

is a sample from the process with a discrete input measure

\sum_{i = 1}^{\infty} μ_{i} δ_{x_{i}} (x)

where the

x_{i}

are distinct, then

\sum_{i = 3}^{\infty} γ_{i} δ_{x_{i}} (x) + (γ_{1} + γ_{2}) δ_{x_{1}} (x)

is a sample from the process with input measure

\sum_{i = 3}^{\infty} μ_{i} δ_{x_{i}} (x) + (μ_{1} + μ_{2}) δ_{x_{1}} (x)

.

The aggregation property can be used to form arbitrary groupings of the dimensions.

Definition 4.

(Renormalisation property) Consider a process that takes a measure as an input parameter and outputs a probability measure. The process has the renormalisation property if when

\sum_{i = 1}^{\infty} γ_{i} δ_{x_{i}} (x)

is a sample from the process with a discrete input measure

\sum_{i = 1}^{\infty} μ_{i} δ_{x_{i}} (x)

where the

x_{i}

are distinct, then

\frac{1}{\sum_{i = 2}^{\infty} γ_{i}} \sum_{i = 2}^{\infty} γ_{i} δ_{x_{i}} (x)

is a sample from the process with input measure

\sum_{i = 2}^{\infty} μ_{i} δ_{x_{i}} (x)

.

The renormalisation property then yields probability measures on subsets of the discrete domain, so it can be used for incremental sampling.

Lemma 3.

(Aggregation and renormalisation) Consider the context of Lemma 2. The aggregation property holds for all CRMs and NRMIs. In the case of an NRMI, the renormalisation property holds. For the PYP, the aggregation property holds but not the renormalisation property.

The results for the PYP can be developed using Lemma 1. The aggregation and renormalisation properties together mean that efficient size-biased samplers can be developed for NRMIs by sampling one dimension at a time according to a two-dimensional version of the NRMI, which is effectively the stick breaking construction (although, only a few explicit cases of this are known). Alternatively, one can sample the underlying CRM according to its corresponding infinitely divisible distribution.

A third property of the Dirichlet is neutrality, which applies in the context of renormalisation and requires that the part taken away is independent of the remainder: if

(x_{1}, x_{2}, x_{3}) \sim Dirichlet (α_{1}, α_{2}, α_{3})

, then

(x_{1}, x_{2}) / (x_{1} + x_{2})

is independent of

x_{3}

.

Definition 5.

(Neutrality property) Consider a process that outputs a finite discrete probability measure, and without loss of generality let

\sum_{i = 1}^{I} γ_{i} δ_{x_{i}} (x)

be a sample from the process where the

x_{i}

are distinct. The process is completely neutral if there exists mutually independent non-negative variables

λ_{1}, \dots, λ_{I}

such that

(γ_{1}, \dots, γ_{K})

and

(λ_{1}, λ_{2} (1 - λ_{1}), \dots, λ_{I} \prod_{i = 1}^{I - 1} (1 - λ_{i}))

have the same distribution.

It is known that the only distribution on finite probability vectors with complete neutrality is the Dirichlet distribution [35].

4.2. Results on Moments

Moments of CRMs are critical quantities for their posterior analysis [18,36] to be developed in Section 5 and seen in Section 3.4. The generalised version is derived by unfolding the recursion that relates the moments of a distribution to its cumulants. In the context of Lemma 2, where

γ_{i} \sim Γ (μ_{i})

, various moments such as

ε [γ_{i}^{n} | μ_{i}]

and

ε [γ_{i}^{n} e^{- U γ_{i}} | μ_{i}]

can be computed recursively from the moments of the PPP rate

ρ (d λ)

[22] ([Section 1.3]) and its exponentially titled form. Note these moments compute the marginals one needs for multinomial and Poisson data, respectively, hence their importance.

In the theorem, the notation

P^{n}

is used to represent all possible non-empty partitions of n items, the set

{1, \dots, n}

. As an example,

P^{3}

is the set

{{{1, 2, 3}}, {{1}, {2, 3}}, {{1, 2}, {3}}, {{1, 3}, {2}}}, {{1}, {2}, {3}}},

so it contains the partition

{{1}, {2, 3}}

as an element, for instance. Moreover,

P_{K}^{n} \subseteq P^{n}

are all members are of size K, so

| P_{1}^{n} | = | P_{n}^{n} | = 1

and

| P_{2}^{3} | = 3

.

The following Lemma is a corollary the major result by Pitman [22], and some related results appear in Camerlenghi et al. [14], as proven in Appendix A.

Lemma 4.

(CRM moments when base measure is discrete) Consider the context of Lemma 2. Let

κ_{n} = \int_{0}^{\infty} λ^{n} ρ (d λ)

be the n-th moment for rate

ρ (λ)

, where it exists for

n \in N^{+}

. Let

ψ (t)

be the Laplace exponent for the rate. Then, the n-th cumulant of

γ_{i}

can be re-expressed as a moment of the original rate

ρ (λ)

, and the n-th moment of

γ_{i}

is computed recursively from it.

\begin{matrix} κ_{n} & = & {(- 1)}^{n + 1} {\frac{d^{n} ψ (t)}{d t^{n}}|}_{t = 0} \end{matrix}

(4)

\begin{matrix} {c u m u l a n t}_{n} (γ_{i}) & = & μ_{i} κ_{n} \end{matrix}

(5)

\begin{matrix} ε [γ_{i}^{n} | μ_{i}] & = & \sum_{Π \in P^{n}} μ_{i}^{| Π |} \prod_{C \in Π} κ_{| C |} \end{matrix}

(6)

\begin{matrix} = & \sum_{K = 1}^{n} μ_{i}^{K} T_{K}^{n} \end{matrix}

(7)

\begin{matrix} w h e r e T_{K}^{n} & = & \sum_{Π \in P_{K}^{n}} \prod_{C \in Π} κ_{| C |} \\ = & \sum_{k = 1}^{n - K + 1} T_{K - 1}^{n - k} (\binom{n - 1}{k - 1}) κ_{k} . \end{matrix}

(8)

Note the recursion for

T_{K}^{n}

starts at

T_{1}^{n} = κ_{n}

derived from the non-recursive form.

Thus, if the Laplace exponent is known, one can usually compute the moments of the process and hence the cumulants and evidence terms for its corresponding infinitely divisible distribution. When one has Poisson data, required moments need to include an exponential term, as proven in Appendix B.

Corollary 1.

(Adding an exponential term) Consider the context of Lemma 4 with rate

ρ (λ)

. To obtain exponentiated moments of the form

ε [γ_{i}^{n} e^{- U γ_{i}} | μ_{i}]

, complete the following steps.

1.: Use rate $e^{- U λ} ρ (λ)$ , and the Laplace exponent is given by $ψ (U + t) - ψ (U)$ , so the corresponding moments are given by

$κ_{n, U} = {(- 1)}^{n + 1} {\frac{d^{n} ψ (t)}{d t^{n}}|}_{t = U}$
2.: Obtain the corresponding $T_{K}^{n}$ using Equation (8) with the $κ_{n, U}$ , denoted $T_{K, U}^{n}$ .
3.: Consequently,

$ε [γ_{i}^{n} e^{- U γ_{i}} | μ_{i}] = e^{- μ_{i} ψ (U)} \sum_{K = 1}^{n} μ_{i}^{K} T_{K, U}^{n} .$

The components from Lemma 4 for the processes in Table 1 are given in Table 3. These appear in various places in the broader statistical literature. The Laplace exponent is usually computed using integration by parts. The form

S_{s, α}^{n}

is the second order generalised Stirling number used in PYP inference [1,37], a generalized Stirling number of type

(- 1, - d, 0)

[38]. It can be verified using its recursion [37] with Equation (8).

Note the general beta process has no simple analytic form for either

ψ (t)

or its marginal distribution. Fortunately, is is difficult to envisage a situation where it would be used hierarchically.

4.3. The Gamma Process

Let us consider the simple example of a gamma process,

GP (M, β)

and assume data yields Poisson likelihoods in the form

\prod_{i = 1}^{I} γ_{i}^{n_{i}} e^{- U γ_{i}}

for dimensions

i = 1, \dots, I

in the context of Lemma 2. The marginal likelihood then, for the

data = {n_{i}, x_{i} : i = 1, \dots, I}

is given by

Pr (d a t a | μ (\cdot)) = ε [e^{- U γ_{R}} | μ_{R}] \prod_{i = 1}^{I} ε [γ_{i}^{n_{i}} e^{- U γ_{i}} | μ_{i}]

where the expectation is taken with respect to

γ (\cdot) \sim GP (M, β, μ (\cdot))

, which has

γ_{i} \sim gamma (M μ_{i}, β)

. Note, in this case, the exact solution is known since the data marginals of the gamma distribution have a simple closed form,

ε [γ_{i}^{n_{i}} e^{- U γ_{i}} | μ_{i}] = \int_{0}^{\infty} γ^{n_{i}} e^{- U γ} \frac{β^{M μ_{i}}}{Γ (M μ_{i})} γ^{M μ_{i} - 1} e^{- β γ} d γ = \frac{Γ (M μ_{i} + n_{i})}{{(U + β)}^{M μ_{i} + n_{i}}} \frac{β^{M μ_{i}}}{Γ (M μ_{i})}

(9)

Consider, however, using Corollary 1. In this case, moments including

e^{- U γ_{i}}

are found to be

κ_{n} = \int_{0}^{\infty} γ^{n} e^{- U γ} ρ (d γ) = M \frac{Γ (n)}{{(U + β)}^{n}}

, and the Laplace exponent can be obtained using integration by parts as

M log (1 + t / β)

. One can confirm that the corresponding index

T_{K}^{n} = \frac{1}{{(U + β)}^{n}} S_{K}^{n} M^{K}

where

S_{K}^{n}

is an unsigned Stirling number of the first kind, an index that is found in collapsed versions of the CRP. Equation (8) yields the standard recurrence for it. So, by Equation (7), and adding back the term

e^{- μ_{i} ψ (U)} = {(\frac{β}{U + β})}^{M μ_{i}}

as per Corollary 1, obtain for atom index i the moment

Pr (γ_{i}^{n} e^{- U γ_{i}} | μ_{i}) = ε [γ_{i}^{n} e^{- U γ_{i}} | μ_{i}] = \frac{β^{M μ_{i}}}{{(U + β)}^{M μ_{i} + n_{i}}} \sum_{K = 1}^{n_{i}} S_{K}^{n_{i}} {(M μ_{i})}^{K} .

(10)

The sum can be converted using a standard identity [37] ([Lemma 16]) to get back The sum in Equation (10) has an interpretation as a form of Chinese restaurant process for the dimension i. Each partition of the set

{1, \dots, n_{i}}

, given by

Π_{i} \in P^{n_{i}}

corresponds to a configuration of the

n_{i}

data in

| Π_{i} |

tables. For any table with participants

C \in Π_{i}

, the probability of the table is

M μ_{i} \frac{Γ (| C |)}{{(U + β)}^{| C |}}

. The probability of this configuration

Π_{i}

is then

\prod_{C \in Π_{i}} M μ_{i} \frac{Γ (| C |)}{{(U + β)}^{| C |}}

. So, introducing the partition

Π_{i}

or its size as an additional variable,

\begin{matrix} Pr (γ_{i}^{n} e^{- U γ_{i}}, Π_{i} | μ_{i}) & = & \frac{β^{M μ_{i}}}{{(U + β)}^{M μ_{i} + n_{i}}} {(M μ_{i})}^{| Π_{i} |} \prod_{C \in Π_{i}} Γ (| C |) \\ Pr (γ_{i}^{n} e^{- U γ_{i}}, | Π_{i} | = K | μ_{i}) & = & \frac{β^{M μ_{i}}}{{(U + β)}^{M μ_{i} + n_{i}}} S_{K}^{n_{i}} {(M μ_{i})}^{K} . \end{matrix}

The second form, the probability of all configurations of size

K (= | Π_{i} |)

, follows from Equation (8).

4.4. General Chinese Restaurant Processes

Motivated by the gamma process example just given, now construct a generalised CRP interpretation of the results in Section 4.2. The marginals have an interpretation as generalised versions of Chinese restaurants, including the more efficient collapsed versions [6], both developed in this section. This is intended to complement the comprehensive Bayesian analysis already developed for the non-hierarchical cases by [16,18].

The significance of the formula in Lemma 4 is that the sum in Equation (6) is over partitions

Π

of the n data points, and

κ_{| C |}

represents the probability of generating a single element C of size

| C |

(in the partition

Π

) according to the rate

ρ (λ)

. The sum in Equation (7) is now over partition sizes K, and

T_{K}^{n}

is the probability of generating a partition of K non-empty sets according to the rate

ρ (λ)

.

Lemma 5.

(General Chinese restaurant processes for CRMs) Consider the posterior data marginal for

γ (\cdot)

, as in Corollary 2, where data is in the form of a Poisson likelihood with counts

n_{i} > 0

at each atom

x_{i}

:

Pr ({n_{i}, x_{i} : i = 1, \dots, I} | γ (\cdot), U) = \prod_{i = 1}^{I} γ_{i}^{n_{i}} e^{- U γ_{i}}

One can treat

Π_{i} \in P^{n_{i}}

as a latent variable, which represents the seating configuration for instances of the atom. Then, the data marginal using

Π_{1}, \dots, Π_{I}

takes the form:

Pr ({n_{i}, x_{i}, Π_{i} : i = 1, \dots, I} | μ (\cdot), U (= e^{- ψ (U) \sum_{i = 1}^{\infty} μ_{i}} \prod_{i = 1}^{I} (μ_{i}^{| Π_{i} |} \prod_{C \in Π_{i}} κ_{| C_{i} |, U}) .

(11)

Moreover, for any j (including

j > I

),

Pr (x_{j} | {n_{i}, x_{i}, Π_{i} : i = 1, \dots, I}, μ (\cdot)) = μ_{j} κ_{1, U} + \sum_{C \in Π_{j}} \frac{κ_{| C | + 1, U}}{κ_{| C |, U}},

(12)

where the convention is used that

Π_{j} = \emptyset

for

j > I

(when there is no data). Alternatively, if

K_{i}

, the number of tables for atom index i is handled as a latent variable, then the data marginal given table numbers takes the form:

Pr ({n_{i}, x_{i}, K_{i} : i = 1, \dots, I} | μ (\cdot), U) = e^{- ψ (U) \sum_{i = 1}^{\infty} μ_{i}} \prod_{i = 1}^{I} μ_{i}^{K_{i}} T_{K_{i}, U}^{n_{i}} .

(13)

Equation (12) is related to the generalized Blackwell–MacQueen sampling scheme by James et al. [18] [Section 3.3]. The data marginals in Equations (11) and (13) have a simple Poisson likelihood in

\vec{μ}

. Thus, a CRP interpretation of a Gamma process can be used for hierarchical inference with a Gamma distribution, as used by Zhou and Carin [11], for instance.

To develop a corresponding formula for NRMIs where they are generated by normalising a CRM, we use an ingenious technique for normalising a CRM within a posterior analysis from [18] The basic idea is to convert multinomial sampling into Poisson sampling (without normalisation) but require some post manipulation to derive the results. A generative variation of this goes as follows:

For each multinomial $\vec{n}$ according to the unnormalised values $\vec{λ}$ , introduce a scale-free latent relative mass denoted U, with the scale-invariant improper prior $\frac{d U}{U}$ .
Generate the data needed according to Poisson $n_{i} \sim Poisson (U λ_{i})$ for $i = 1, \dots, \infty$ , noting that $n_{i} = 0$ for $i > I$ .
Then, the joint posterior on $\vec{n}, \vec{λ}, U$ becomes quite concentrated for U and can be marginalised out.
To correct the formulas, multiply the marginal by $N = \sum_{i = 1}^{I} n_{i}$ to obtain a conversion to a multinomial.

To see that this indeed does what is required, one needs to verify the following identity.

N \int_{R^{+}} \prod_{i = 1}^{\infty} (e^{- U λ_{i}} \frac{{(U λ_{i})}^{n_{i}}}{n_{i}!}) \frac{d U}{U} = (\binom{N}{\vec{n}}) \prod_{i = 1}^{I} {(\frac{λ_{i}}{\sum_{i} λ_{i}})}^{n_{i}} .

Note the product

\prod_{i = 1}^{\infty}

is well-defined because

\sum_{i = 1}^{\infty} λ_{i}

is finite.

Corollary 2.

(General Chinese restaurant processes for NRMIs) Consider the posterior data marginal for

γ (\cdot)

as given in Lemma 2, where data is in the form of a multinomial likelihood with counts

n_{i} > 0

at each atom

x_{i}

:

Pr ({n_{i}, x_{i} : i = 1, \dots, I} | γ (\cdot)) = \prod_{i = 1}^{I} {(\frac{γ_{i}}{\sum_{i = 1}^{\infty} γ_{i}})}^{n_{i}},

and let

N = \sum_{i = 1}^{I} n_{i}

be the total count. Let

U \sim g a m m a (N, \sum_{i = 1}^{\infty} γ_{i})

. Then, the data marginal using

Π_{1}, \dots, Π_{I}

, similarly to Lemma 5, takes the form:

Pr ({n_{i}, x_{i}, Π_{i} : i = 1, \dots, I}, U | μ (\cdot)) = \frac{U^{N - 1}}{Γ (N)} e^{- ψ (U) \sum_{i = 1}^{\infty} μ_{i}} \prod_{i = 1}^{I} (μ_{i}^{| Π_{i} |} \prod_{C \in Π_{i}} κ_{| C_{i} |, U}) .

(14)

Moreover, for any j (including

j > I

),

Pr (x_{j} | {n_{i}, x_{i}, Π_{i} : i = 1, \dots, I}, U, μ (\cdot)) = μ_{j} κ_{1, U} + \sum_{C \in Π_{j}} \frac{κ_{| C | + 1, U}}{κ_{| C |, U}} .

(15)

Alternatively, if each

K_{i}

is handled as a latent variable, then the data marginal given table numbers takes the form:

Pr ({n_{i}, x_{i}, K_{i} : i = 1, \dots, I}, U | μ (\cdot)) = \frac{U^{N - 1}}{Γ (N)} e^{- ψ (U) \sum_{i = 1}^{\infty} μ_{i}} \prod_{i = 1}^{I} μ_{i}^{K_{i}} T_{K_{i}, U}^{n_{i}} .

(16)

Note, to complete the analysis, one needs to model the unseen parts of the processes. So, while it is assumed

μ_{i}

for

i = 1, \dots, I

is being sampled or estimated, of

μ_{i}

and

γ_{i}

for

i = I + 1, \dots, \infty

only a finite number, if any, can be sampled or estimated. Handling these is illustrated in Section 5 using a remainder term

μ_{R} = \sum_{j = I + 1}^{\infty} μ_{j}

.

In general, then, there are two different levels of inference one can use when the marginal does not have a simple closed form and must instead be computed using the latent forms in Lemma 5 or Corollary 2:

Sampling over table configurations:

For the DP, this is exhibited by the standard CRP. One can see from Equations (6) and (12) that to resample which table a point belongs to, one would use the following proportionalities:

Pr (C | Π, μ_{k}, \dots) \propto \{\begin{matrix} μ_{k} κ_{1} & s t a r t a n e w t a b l e \\ \frac{κ_{| C | + 1}}{κ_{| C |}} & a d d t o t a b l e C \end{matrix} .

(17)

Sampling over table sizes:

For the PYP, this is demonstrated by table indicator sampling methods [6,39] and “direct” Gibbs sampling of Gasthaus and Teh [5], though subsequently not used because in their context they needed to constantly resample discount

α

. This is a collapsed sampler that instead samples K, the number of tables using Equation (7):

Pr (K | μ_{k}, \dots) \propto μ_{k}^{K} T_{K}^{n}

(18)

This is only efficient when

T_{K}^{n}

can be tabulated. In the general case, this requires

O (n^{2} K)

steps to follow using Equation (8) and

O (n K)

for cases such as the gamma process above where a simpler double recursion is available for

T_{K}^{n}

since they are generalised second-order Stirling numbers.

5. Variants of the Generalised Gamma Process

In this section, we develop both the CRM and NRMI variants of the generalised gamma process in the hierarchical context. Using the generalised gamma process in an NRMI yields an NGG or a PYP. When the NGG process and the PYP are supplied discrete base distributions as input, they behave analogously to the Dirichlet distribution, as illustrated with Lemma 3. In this discrete context, refer to the corresponding distributions as the NGG distribution and the Pitman–Yor distribution (PYD). Here analytical forms of the PY distribution are developed.

The Hierarchical Context

Consider an NRMI in the context of the base distribution

μ (x)

, as before. Suppose multinomial type data is observed in the form of counts

n_{k}

associated with the atoms

x_{k}

of

μ (\cdot)

for

k = 1, \dots, K

, with total count

N = \sum_{k = 1}^{K} n_{k}

, where all others are zero. The latent relative mass trick of James et al. [18] can be used to include U as a latent variable in the likelihood for the NGG and the PYD. Setting

U = 1

and dividing by N in this case restores the posterior to the original Poisson version. The likelihood for a PYD also includes M (via Lemma 1). To express this, the remainder terms for both the base distribution and the CRM need to be represented.

\begin{matrix} λ_{R} & = & \sum_{k = K + 1}^{\infty} λ_{k} = Λ - \sum_{k = 1}^{K} λ_{k} \\ μ_{R} & = & \sum_{k = K + 1}^{\infty} μ_{k} . \end{matrix}

The joint posterior for the NGG is now

\begin{matrix} Pr ({λ_{k}, n_{k}, x_{k} : k = 1, \dots, K}, U, λ_{R} | GGP, M, α, β, N, μ (\cdot)) \\ = \frac{1}{N^{1_{U \neq 1}} Γ (N)} e^{- U Λ} U^{N - 1} Pr (λ_{R} | Twe (α, {(M μ_{R})}^{1 / α}, 1)) \\ \prod_{k = 1}^{K} λ_{k}^{n_{k}} Pr (λ_{k} | Twe (α, {(M μ_{k})}^{1 / α}, 1)) \\ = \frac{1}{Γ (N)} e^{- M ({(1 + U)}^{α} - 1)} U^{N - 1} Pr (λ_{R} | Twe (α, {(M μ_{R})}^{1 / α}, 1 + U)) \\ \prod_{k = 1}^{K} λ_{k}^{n_{k}} Pr (λ_{k} | Twe (α, {(M μ_{k})}^{1 / α}, 1 + U)), \end{matrix}

(19)

where the second line is obtained by applying the exponential tilting formula. Note, Lemma 2 means element-wise application of a distribution to the parameter vector

\vec{μ}

inside

μ (\cdot)

. Forms for the PYD are obtained by adding the prior for M. For the normalised stable process, denoted NSP, one obtains

\begin{matrix} Pr ({λ_{k}, n_{k}, x_{k} : k = 1, \dots, K}, U, λ_{R} | NSP, M, α, N, μ (\cdot)) \\ = \frac{1}{Γ (N)} e^{- U Λ} U^{N - 1} Pr (λ_{R} | pstable (α, {(M μ_{R})}^{1 / α})) \\ \prod_{k = 1}^{K} λ_{k}^{n_{k}} Pr (λ_{k} | pstable (α, {(M μ_{k})}^{1 / α})) \\ = \frac{1}{Γ (N)} e^{- M U^{α}} U^{N - 1} Pr (λ_{R} | Twe (α, {(M μ_{R})}^{1 / α}, U)) \\ \prod_{k = 1}^{K} λ_{k}^{n_{k}} Pr (λ_{k} | Twe (α, {(M μ_{k})}^{1 / α}, U)) . \end{matrix}

(20)

From this, one can derive an integral formula for the PYD. Details are in Appendix C, and the result is original.

Lemma 6.

(Integral formula for the PY distribution) Let

\vec{μ}

be a K-dimensional non-zero probability vector. Then, consider

\vec{θ} \sim P Y D (α, β, \vec{μ})

for

α > 0

and

β \geq 0

. To express the probability of

\vec{θ}

, introduce corresponding latent variables

\vec{ν} = (ν_{1}, \dots, ν_{K}) \in {[0, π]}^{K}

:

Pr (\vec{θ} | P Y D, α, β, \vec{μ}) = \int_{{[0, π]}^{K}} \frac{α^{K - 1} Γ (1 + β)}{{(1 - α)}^{K - 1} π^{K} Γ (1 + β / α)} \frac{Γ (K + β (1 - α) / α) \prod_{k = 1}^{K} a_{α} (ν_{k}) {(\frac{μ_{k}}{θ_{k}})}^{1 / (1 - α)}}{{(\sum_{k = 1}^{K} a_{α} (ν_{k}) {(\frac{μ_{k}}{θ_{k}})}^{1 / (1 - α)} θ_{k})}^{K + β (1 - α) / α}} d \vec{ν} .

(21)

This can be readily evaluated using numerical integration for small K. Plots of the marginal for

θ_{1}

for different parameter settings are given in Figure 1 and Figure 2.

Due to the aggregation property of the PYD, these are representative marginals of the distribution for all dimensions. One can see the distributions becoming increasingly skewed as

α

increases.

6. Networks of Processes

The next natural question to consider is how the above results apply to networks of processes. Several general schemes have been developed for inference in more general networks [3,6,11,19,39,40]. General networks for HPYPs have been demonstrated to scale [4,6], in contrast to earlier Gibbs schemes [3,40], and arguably the HGP has advantages over the HDP [11]. This section is a review of related material with regards to hierarchical processes.

6.1. Identifiability

One important question is the issue of statistical identifiability, and an underlying issue here is whether the parametric structure admits a unique representation [41]. In our case, some simple classes of non-uniqueness are easily identified and avoided. For instance, in Poisson matrix factorisation, if the matrix entry

x_{i, j} \sim P o i s s o n (\sum_{k = 1}^{K} θ_{i, k} ϕ_{k, j})

, then one can insist that the scale of one of the matrices

\vec{Θ}

or

\vec{Φ}

(comprising the entries

θ_{i, k}

and

ϕ_{k, j}

respectively) needs to be anchored somehow so that the scale of the Poisson parameter is uniquely determined by just the other one. So, the rows of one of the matrices should normalise.

6.2. Equivalences

Another issue is that in some cases, networks can be transformed from one case to another. For instance, Zhou and Carin [11] ([Section VB]) show that a Poisson gamma-gamma process construction is equivalent to a HDP construction with an independent Poisson-gamma on the total. Given that there are significant differences between the corresponding algorithms in this case, and there are many more in the literature, what other equivalences are there?

Normalising processes are conducted to convert a CRM into an NRMI and in some cases, independence between the parts yields an equivalence between the CRM form and the NRMI form augmented with a total. This has major implications to networks of such processes, presented in the following subsection, so the results are summarised here.

The first results are on discrete processes and are well-known, some for instance reproduced by Zhou and Carin [11].

Lemma 7.

(Equivalent processes) Let

\vec{μ}

be a probability vector (possibly infinite), and M be a constant positive background rate. Let

X = \sum_{i = 1}^{\infty} x_{i}

, the sum of entries of the non-negative integer vector

\vec{x}

. The following equivalences between (A) and (B) hold:

Conditioning the PPP,

$(A) \vec{x} \sim P P (M \vec{μ}) (B) X \sim P o i s s o n (M) a n d \vec{x} \sim M P (X, \vec{μ}) .$
NBP as a Poisson-gamma mixture,

$(A) \vec{x} \sim N B P (M, ρ, \vec{μ}) (B) \vec{x} \sim P P (G P (M, \frac{1 - ρ}{ρ}, \vec{μ})) .$
DCMP, given $X \in N^{+}$ , as a multinomial-Dirichlet mixture,

$(A) \vec{x} \sim D C M P (X, M \vec{μ}) (B) \vec{x} \sim M P (X, D P (M \vec{μ})) .$
Conditioning the NBP,

$(A) \vec{x} \sim N B P (M, ρ, \vec{μ}) (B) X \sim N B (M, ρ) a n d \vec{x} \sim D C M P (X, M \vec{μ}) .$

The conditioned versions of the PPP and NBP are used to decompose a likelihood into a total count and the vector of counts for atoms, given the total. Notice, while the conditioned version of the PPP yields a likelihood where the normalised measure (

\vec{μ}

) and its total (M) are independent, the same does not hold for the conditioned NBP.

6.3. Normalisation and Independence

On non-discrete processes, some independences apply.

Lemma 8.

(Normalised processes and independence) Let

Λ = \sum_{i = 1}^{\infty} λ_{i}

, the sum of entries of the infinite non-negative real vector

\vec{λ}

. The following two pairs (A) and (B) are equivalent:

For the gamma process:
(A)
$\vec{λ} \sim G P (M, β)$ ;
(B)
$Λ \sim g a (M, β)$ and $\vec{λ} / Λ \sim G E M (0, M)$ , where Λ and $\vec{λ} / Λ$ are independent.
For the generalised gamma process where $0 < α < 1$ , marginalising M
(A)
$\vec{λ} \sim G G P (M, α, β)$ where $M \sim g a (δ / α, β^{α})$ ;
(B)
$Λ \sim g a (δ, β)$ and $\vec{λ} / Λ \sim G E M (α, δ)$ , where Λ and $\vec{λ} / Λ$ are independent.

Moreover, the gamma process is the only case of such independence possible for pure NRMIs (this excludes the second case as it is marginalised).

Independence in the PYP case (represented as

GEM (α, δ)

in the lemma) is shown by Pitman and Yor [28] ([Proposition 21]).

That the gamma process is the only independence case for CRMs and their NRMIs is a result by Perman et al. [27] ([Corollary 2.3]). This is equivalent to the neutrality of the Dirichlet distribution, again the only distribution on probability vectors exhibiting neutrality. Neutrality and independence in this case can be shown to be equivalent properties. Independence in both these cases is also a consequence of the fact that so-called sized-biased sampling for the cases is independent of the total [27,29]. Independence properties such as in Lemma 8 do not hold generally, as indeed sized-biased sampling is not generally independent of the total.

Lemma 9.

(Normalisation of other process) Let

Λ = \sum_{i = 1}^{\infty} λ_{i}

, the sum of entries of the infinite vector

\vec{λ}

.

For the generalised gamma process, if $\vec{λ} \sim G G P (M, α, β)$ then $Λ \sim T w e (α, M^{1 / α}, β)$ and $\vec{λ} / Λ \sim N G G (α, M)$ .
For the stable process, if $\vec{λ} \sim s t a P (M, α)$ then $Λ \sim p s t a b l e (α, M^{1 / α})$ and $\vec{λ} / Λ \sim P Y P (α, 0)$ ,

Λand

\vec{λ} / Λ

are not independent in either case.

6.4. Modelling LDA Using HDP

Consider models for the HDP variant of LDA [3], called HDP-LDA, which has been the subject of extensive research. There is a wide variation in the literature of how these are to be represented by graphical model and for statistical inference. Figure 3 shows two equivalent models for HDP-LDA. Figure 3a gives the original model as formulated by Teh et al. [3], and Figure 3b shows the modification used here. Authors sometimes use a more complicated formulation in terms of the underlying stick-breaking model.

In this problem, there are D documents and

N_{d}

words in each document for

d = 1, \dots, D

, where the words

w_{d, n}

are modelled with an admixture. The probabilistic specification for the corresponding models are given in Figure 4.

Figure 4a shows the probabilistic specification with full base distributions. While this follows the theory directly, it is a fairly large departure from the original representation of LDA. The reformulation in Figure 4b is a direct analogue of the original representation of LDA with two modifications essential for the treatment of a HDP, discussed below as the root node and the non-root node.

The root node of the DP hierarchy is represented as a GEM, which generates the infinite vector. In practice, this can be represented using size-biased sampling [27] formulations, and in the simplest and popular cases this corresponds to stick-breaking methods [42]. In implementation, however, there is no need for this as posterior formulations for the processes are well understood and require no implicit ordering constraints as in stick-breaking.

Non-root nodes down the hierarchy are represented using their underlying infinitely divisible non-negative distribution, in this case the Dirichlet. Note, however, this extends the standard definition of a Dirichlet as the input parameter is an infinite dimensional vector. In implementation, this is no impediment as only a finite amount of data is ever dealt with, although it does require modelling the current number of non-empty dimensions. This can be readily handled using standard parametric techniques [6] or by using truncation [4].

Note Figure 4a also uses a nested construction [43] with the expression

DP (c_{α}, Dirichlet (c_{β} \vec{β}))

. Here a distribution, in this case a Dirichlet, but it could also be a GP, a DP or any other process, is used as the base distribution. This nesting construction is exactly what is needed to model matrix and tensor factorisation using hierarchical processes.

The nested, hierarchical equivalent to Figure 5b is as follows:

\begin{matrix} \vec{β} & \sim & G E M (d_{β}, c_{β}) \\ G_{0} & \sim & G P (c_{α}, 1, PYD (d_{ϕ}, c_{ϕ}, \vec{β})) \\ G_{d} & \sim & G P (c_{θ}, s_{θ}, G_{0}) \\ ϕ_{d, n} & \sim & G_{d} \\ {\vec{n}}_{d} & \sim & P o i s s o n ({\vec{ϕ}}_{d}) \end{matrix}

The background word probabilities

\vec{β}

are generated, then used as the base distribution for a PYD which then creates variants

{\vec{ϕ}}_{k}

as each atom of the gamma process

G_{0}

. The mixture weights of

G_{0}

correspond to

\vec{α}

from Figure 5b. Variants of this,

G_{d}

, are then created which modify the mixture weights

\vec{α}

but leave the atoms constant. So,

G_{d}

is a weighted sum of the original

{\vec{ϕ}}_{k}

, as is the case in Figure 5b. This is very elegant, but Figure 5b better exposes the detail needed for implementation.

6.5. Example Equivalences with Non-Parametric LDA

Consider extending HDP-LDA to include a Pitman–Yor distribution on the word side. This model, termed NP-LDA Buntine and Mishra [4], has been demonstrated using a truncated approximation. To bring out equivalences, the multinomial form of the topic model is given, and both are defined in Figure 5.

The gamma scale parameter on

α_{0}

is one as it has an equivalent affect to

c_{θ}

. So, it needs to be made a constant for identifiability. The equivalence is obtained by noting, from Lemmas 7 and 8, and many such results exist for the finite case, for instance by [44]. One can introduce a total rate for documents,

Θ_{d}

, and model the count,

N_{d}

, entirely independently:

\begin{matrix} α_{0} & \sim & g a m m a (c_{α}, 1) \\ Θ_{d} & \sim & g a m m a (c_{θ} α_{0}, s_{θ}) \\ N_{d} & \sim & P o i s s o n (Θ_{d}) . \end{matrix}

If the concentration parameters are estimated during learning, which is the common case, and recommended for topic models, then equivalence does not hold.

Experimental evidence [4] shows the following:

The topic side, ${\vec{θ}}_{d}$ , is best not modelled using PYPs because experiments indicate that this gives no performance improvement. The non-Zipfian DPs work best, probably because of the smaller dimensions for number of topics.
Modelling the word side, ${\vec{ϕ}}_{k}$ , using PYPs systematically outperforms HDP-LDA by a moderate margin in perplexity and yields more explainable topics because the overall “background” words are separately modelled using $\vec{β}$ .

Several model equivalences hold with regard to these kinds of models.

The asymmetric-symmetric version of LDA [45] is a truncated version, not well understood in the community.
The asymmetric-asymmetric version of LDA, evaluated by Wallach et al. [45], is a truncated version of the model in Figure 5a.
Hierarchical Poisson factorisation [46] (HPF) is a non-parametric formulation of Poisson-gamma matrix factorisation using stick-breaking, and thus is equivalent to HDP-LDA above (when augmented with a gamma model of the total counts).
Robust (negative binomial) Poisson factorisation by Zhou et al. [47] is related (ignoring some issue with hyperparameters) to bursty topic models by Doyle and Elkan [48], which has a non-parametric extension in Buntine and Mishra [4].

7. Conclusions

Discrete base distributions make CRMs behave like vectors of infinitely divisible distributions, where application is element-wise without the non-parametrics. So, the gamma process becomes an element-wise gamma distribution, and the generalised gamma process becomes an element-wise Tweedie distribution. This was presented in Lemma 2, Lemma 4 and Corollary 1 and accompanying tables. Similarly, discrete base distributions make NRMIs and related processes behave as normalised versions of the above, sharing some properties of the DP such as renormalisation. So, the HPYP becomes the PY distribution, whose form was developed in Section 5.

If closed forms for analysis of the infinitely divisible distributions do not exist, the generalised versions of Chinese restaurant process (CRP) sampling, given in Equation (17), can be used instead, including versions of the more recent, efficient collapsed samplers for CRPs [6,39], given in Equation (18). Similar formulations also appear in [14,15]. Note many of these quantities, for instance in Table 1, can be derived from the Laplace exponent of the CRM, so a convenient form of the distribution is not needed. The CRPs come about when unfolding the recursion that relates the cumulants of a distribution to the moments of the distribution, a simple result in basic statistics. In this way, known CRPs for the gamma process follow a general scheme that also applies for the generalised gamma process, the generalised beta process and others.

While most of these results follow fairly simply from general results in the non-parametric Bayesian community, some have not yet seen use in the Bayesian machine learning community.

As a specific example of hierarchical distributions, it was also shown in Section 5 that the NGG and PY distributions, for the case where discount

α > 0

and concentration

β > 0

, are behaving like normalised Tweedie variables, and for the case where concentration

β = 0

like normalised positive stable variables. Moments of the Tweedie distribution show how the standard hierarchical likelihood for the HPYP used to date [5,37] can be directly derived from this framework without considering non-parametric theory. A novel integral expression for the PY distribution for discount

α > 0

and concentration

β \geq 0

was also developed in Equation (21). This answers the question, “what is a hierarchical PYP”?

There are a rich number of variations of matrix factorisation and topic models that exist, for instance, see ([11] Table 1) with seven different versions of negative binomial matrix factorisation, and the software used in Buntine and Mishra [4] has seven different non-parametric versions of LDA. This is ignoring the extensions of the model where the problem is changed significantly: document segmentation [7], hierarchical topics [8], supervised topic models, etc., and these extensions no doubt have their own rich variety of versions and equivalences. Moreover, some of the known equivalences between processes, when applied in the hierarchical case, yield relationships between models and algorithms in the machine learning community that deserve further investigation, discussed in Section 6. This is confounded by the fact that variants are evaluated using significantly different methodologies; compare, for instance, topic modelling evaluation with recommender systems evaluation. It is an open question as to what other significant equivalences exist in the literature, and the implications this has to the algorithms one can use.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Proof of Lemma 4

Proof.

The major result is by Pitman [22]. Equation (5) is obtained by differentiating inside the integral of the Laplace exponent. Note that when they both exist, cumulants

κ_{n}

and central moments

c_{n}

are related by the following recursive formula

c_{n} = κ_{n} + \sum_{k = 1}^{n - 1} (\binom{n - 1}{k - 1}) κ_{k} c_{n - k} .

One can expand this iteratively to remove the recursion on moments. While

P^{n}

represents the set of all non-empty partitions of n objects, let

S^{n}

denote the set of all vectors representing the sizes of non-empty partitions of n. So, if

\vec{m} \in S^{n}

then

m_{l} > 0

for

l = 1, \dots, | \vec{m} |

and

\sum_{l = 1}^{| \vec{m} |} m_{l} = n

. One obtain s the following:

{moment}_{n} (γ_{k}) = \sum_{\vec{m} \in S^{n}} \prod_{l = 1}^{| \vec{m} |} {cumulant}_{m_{l}} (γ_{k}) (\binom{n - \sum_{j < l} m_{j} - 1}{m_{k} - 1}) .

This is the same form of expression used in defining the generalised Stirling numbers [37] ([Lemma 16]). The significance is that the sum is over the sizes of the partitions of n, and the product of choose expressions represents the number of partitions with those sizes. Thus, this can be re-expressed as Equation (6). The recursion of Equation (8) can be obtained from the original recursion on

c_{n}

and reformulation. □

Appendix B. Proof of Corollary 1

Proof.

This is based on the following result: Suppose a Poisson process has rate

μ_{k} ρ (λ)

, and the distribution of the total

T = \sum_{k = 1}^{\infty} λ_{k}

from a sample has distribution

Pr (T | μ_{k})

. Then, it follows that given rate

e^{- U λ} ρ (λ)

, the distribution of the total becomes

e^{μ_{k} ψ (U) - U T} Pr (T | μ_{k})

where

ψ (\cdot)

is the Laplace exponent of

ρ (λ)

. The result follows by using the constant

e^{- μ_{k} ψ (U)}

to adjust the moments to those desired. □

Appendix C. Proof of Lemma 6

Proof.

Start with Equation (19) with no data, so

n_{k} = 0

and U can be dropped. Substituting terms, and letting

λ_{R}

be

λ_{0}

:

\begin{matrix} = & e^{M} M^{(K + 1) / (1 - α)} e^{- \sum_{k = 0}^{K} λ_{k}} {(\frac{α}{(1 - α) π})}^{K + 1} \\ \prod_{k = 0}^{K} a (ν_{k}) λ_{k}^{- 1 / (1 - α)} μ_{k}^{1 / (1 - α)} e^{- λ_{k}^{- α / (1 - α)} M^{1 / (1 - α)} μ_{k}^{1 / (1 - α)} a (ν_{k})} . \end{matrix}

Note it can be seen that conditionally

M^{1 / (1 - α)}

has a gamma distribution. Conditionally, the variables

λ_{k}

are log concave and vanishing to zero at the limits (

0, \infty

). Moreover, by the transformation

λ_{k}^{'} = 1 / (1 + λ_{k})

, the transformed Hessian is only non-negative when the derivative is positive, so the function of

λ^{'} \in [0, 1]

is unimodal and suitable for slice sampling. Moreover, it can also be shown that conditionally the auxiliary variables

ν_{k}

are unimodal and bounded so are readily sampled using efficient slice sampling (as for instance used in a related context by Lomeli et al. [19]).

Marginalising out M by adding its prior and then using the change of variables

m = M^{1 / (1 - α)}

,

\begin{matrix} = & {(\frac{α}{(1 - α) π})}^{K + 1} \frac{(1 - α) e^{- \sum_{k = 0}^{K} λ_{k}}}{Γ (β / α)} \prod_{k = 0}^{K} a (ν_{k}) λ_{k}^{- 1 / (1 - α)} μ_{k}^{1 / (1 - α)} \\ \frac{Γ (K + 1 + β (1 - α) / α)}{{(\sum_{k = 0}^{K} λ_{k}^{- α / (1 - α)} μ_{k}^{1 / (1 - α)} a (ν_{k}))}^{K + 1 + β (1 - α) / α}}, \end{matrix}

then conduct a change of variables from

(λ_{0}, λ_{1}, \dots, λ_{K})

to

(Λ, θ_{1}, \dots, θ_{K})

where

Λ

is the sum and

θ_{k} = λ_{k} / Λ

. The determinant of the Hessian is

Λ^{K}

. This results in an independent term in

Λ

in the form

e^{- Λ} Λ^{β - 1}

which integrates leaving

Γ (β)

. This results in the data likelihood as given in Equation (21), though the dimension has also been changed from

K + 1

to K for simplicity. Moreover,

\frac{(1 - α) Γ (β)}{Γ (β / α)}

has been re-expressed as

\frac{1 - α}{α} \frac{Γ (1 + β)}{Γ (1 + β / α)}

, so it is well-defined when

β = 0

.

The derivation for the

β = 0

case is similar, starting from Equation (20), but again there is no data and

U = 0

. Introduce the integral expression for the

pstable (α, s)

, perform a change of variables from

(λ_{0}, λ_{1}, \dots, λ_{K})

to

(Λ, θ_{1}, \dots, θ_{K})

and then marginalise out

Λ

. At this point, the terms in M will cancel. □

References

Teh, Y. A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the ACL; ACL ’06, Sydney, Australia, 17–21 July 2006; pp. 985–992. [Google Scholar]
Kneser, R.; Ney, H. Improved backing-off for m-gram language modeling. In Proceedings of the Internatinal Conference on Acoustics, Speech, and Signal Processing, Detroit, MI, USA, 9–12 May 1995; Volume 1, pp. 181–184. [Google Scholar]
Teh, Y.; Jordan, M.; Beal, M.; Blei, D. Hierarchical Dirichlet Processes. J. ASA 2006, 101, 1566–1581. [Google Scholar] [CrossRef]
Buntine, W.; Mishra, S. Experiments with Non-parametric Topic Models. In Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014. [Google Scholar]
Gasthaus, J.; Teh, Y. Improvements to the Sequence Memoizer. Adv. Neural Inf. Process. Syst. 2010, 23, 685–693. [Google Scholar]
Lim, K.; Buntine, W.; Chen, C.; Du, L. Nonparametric Bayesian Topic Modelling with the Hierarchical Pitman-Yor Processes. Int. J. Approx. Reason. 2016, 78, 172–191. [Google Scholar] [CrossRef] [Green Version]
Du, L.; Buntine, W.; Johnson, M. Topic Segmentation with a Structured Topic Model. In Proceedings of the NAACL-HLT, Atlanta, GA, USA, 13 June 2013; pp. 190–200. [Google Scholar]
Paisley, J.; Wang, C.; Blei, D.; Jordan, M. Nested hierarchical Dirichlet processes. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 256–270. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Teh, Y.; Jordan, M. Hierarchical Bayesian Nonparametric Models with Applications. In Bayesian Nonparametrics; Hjort, N., Holmes, C., Müller, P., Walker, S., Eds.; Cambridge University Press: Cambridge, UK, 2010; pp. 158–206. [Google Scholar]
Jordan, M. Hierarchical models, nested models and completely random measures. In Frontiers of Statistical Decision Making and Bayesian Analysis: In Honor of James O. Berger; Springer: New York, NY, USA, 2010; pp. 207–218. [Google Scholar]
Zhou, M.; Carin, L. Negative binomial process count and mixture modeling. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 307–320. [Google Scholar] [CrossRef] [Green Version]
Ferguson, T. A Bayesian analysis of some nonparametric problems. Ann. Stat. 1973, 1, 209–230. [Google Scholar] [CrossRef]
Buntine, W. Constructing Poisson Process Models. Monash University, Clayton, Victoria, Australia Unpublished Report. 2018. [Google Scholar]
Camerlenghi, F.; Lijoi, A.; Orbanz, P.; Prünster, I. Distribution theory for hierarchical processes. Ann. Stat. 2019, 47, 67–92. [Google Scholar] [CrossRef] [Green Version]
Argiento, R.; Cremaschi, A.; Vannucci, M. Hierarchical Normalized Completely Random Measures to Cluster Grouped Data. J. Am. Stat. Assoc. 2020, 115, 318–333. [Google Scholar] [CrossRef]
James, L. Bayesian Poisson Calculus for Latent Feature Modeling via Generalized Indian Buffet Process Priors. Ann. Stat. 2016, 45, 2016–2045. [Google Scholar] [CrossRef]
Griffiths, T.; Ghahramani, Z. The Indian Buffet Process: An Introduction and Review. J. Mach. Learn. Res. 2011, 12, 1185–1224. [Google Scholar]
James, L.; Lijoi, A.; Prünster, I. Posterior analysis for normalized random measures with independent increments. Scand. J. Stat. 2009, 36, 76–97. [Google Scholar] [CrossRef]
Lomeli, M.; Favaro, S.; Teh, Y. A marginal sampler for σ-stable Poisson-Kingman mixture models. J. Comput. Graph. Stat. 2015, 9, 44–53. [Google Scholar] [CrossRef]
Sato, K.I. Basic Results on Lévy Processes. In Lévy Processes: Theory and Applications; Barndorff-Nielsen, O., Mikosch, T., Resnick, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2001; pp. 3–37. [Google Scholar]
Lijoi, A.; Prünster, I. Models beyond the Dirichlet process. In Bayesian Nonparametrics; Hjort, N., Holmes, C., Müller, P., Walker, S., Eds.; Cambridge University Press: Cambridge, UK, 2010; pp. 80–135. [Google Scholar]
Pitman, J. Combinatorial Stochastic Processes: Ecole D’Eté de Probabilités de Saint-Flour XXXII-2002; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
James, L. Stick-breaking PG(α,ζ)-Generalized Gamma Processes. arXiv 2013, arXiv:1308.6570. [Google Scholar]
Kingman, J. Random Discrete Distributions. J. R. Stat. Soc. Ser. B (Methodological) 1975, 37, 1–15. [Google Scholar] [CrossRef]
Broderick, T.; Jordan, M.; Pitman, J. Beta processes, stick-breaking and power laws. Bayesian Anal. 2012, 7, 439–476. [Google Scholar] [CrossRef]
Brix, A. Generalized Gamma measures and shot-noise Cox processes. Adv. Appl. Probab. 1999, 31, 929–953. [Google Scholar] [CrossRef] [Green Version]
Perman, M.; Pitman, J.; Yor, M. Size-biased Sampling of Poisson Point Processes and Excursions. Probab. Theory Relat. Fields 1992, 92, 21–39. [Google Scholar] [CrossRef]
Pitman, J.; Yor, M. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Ann. Probab. 1997, 25, 855–900. [Google Scholar] [CrossRef]
Pitman, J. Poisson-Kingman partitions. In Statistics and science: A Festschrift for Terry Speed; Goldstein, D., Ed.; Lecture Notes–Monograph Series; Institute of Mathematical Statistics: Tachikawa, Tokyo, 2003; Volume 40, pp. 1–34. [Google Scholar]
Chen, C.; Buntine, W.; Ding, N. Theory of dependent hierarchical normalized random measures. arXiv 2012, arXiv:1205.4159. [Google Scholar]
Steutel, F.; van Harn, K. Infinite Divisibility of Probability Distributions on the Real Line; Chapman & Hall/CRC Pure and Applied Mathematics; CRC Press: Boca Raton, FL, USA, 2003. [Google Scholar]
Hofert, M. Sampling Exponentially Tilted Stable Distributions. ACM Trans. Model. Comput. Simul. 2011, 22, 3:1–3:11. [Google Scholar] [CrossRef]
Nolan, J. Maximum likelihood estimation and diagnostics for stable distributions. In Lévy Processes: Theory and Applications; Barndorff-Nielsen, O., Mikosch, T., Resnick, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2001; pp. 379–400. [Google Scholar]
Chambers, J.; Mallows, C.; Stuck, B. A method for simulating stable random variables. J. ASA 1976, 71, 340–344. [Google Scholar] [CrossRef]
James, I.R.; Mosimann, J. A New Characterization of the Dirichlet Distribution Through Neutrality. Ann. Stat. 1980, 8, 183–189. [Google Scholar] [CrossRef]
James, L.F.; Lijoi, A.; Prünster, I. Conjugacy as a Distinctive Feature of the Dirichlet Process. Scand. J. Stat. 2006, 33, 105–120. [Google Scholar] [CrossRef]
Buntine, W.; Hutter, M. A Bayesian View of the Poisson-Dirichlet Process. arXiv 2012, arXiv:1007.0296v2. [Google Scholar]
Hsu, L.; Shiue, P.S. A unified approach to generalized Stirling numbers. Adv. Appl. Math. 1998, 20, 366–384. [Google Scholar] [CrossRef]
Chen, C.; Du, L.; Buntine, W. Sampling table configurations for the hierarchical Poisson-Dirichlet process. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD; Springer: Berlin/Heidelberg, Germany, 2011; pp. 296–311. [Google Scholar]
Wood, F.; Teh, Y.W. A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS), Clearwater Beach, FL, USA, 16–18 April 2009. [Google Scholar]
Cole, D.; Morgan, B.; Titterington, D. Determining the parametric structure of models. Math. Biosci. 2010, 228, 16–30. [Google Scholar] [CrossRef]
Ishwaran, H.; James, L. Gibbs Sampling Methods for Stick-Breaking Priors. J. ASA 2001, 96, 161–173. [Google Scholar] [CrossRef]
Rodriguez, A.; Dunson, D.; Gelfand, A. The nested Dirichlet process. J. ASA 2008, 103, 1131–1154. [Google Scholar] [CrossRef]
Ding, C.; Li, T.; Peng, W. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Comput. Stat. Data Anal. 2008, 52, 3913–3927. [Google Scholar] [CrossRef]
Wallach, H.; Mimno, D.; McCallum, A. Rethinking LDA: Why priors matter. Adv. Neural Inf. Process. Syst. 2009, 22, 1973–1981. [Google Scholar]
Gopalan, P.; Ruiz, F.J.R.; Ranganath, R.; Blei, D.M. Bayesian Nonparametric Poisson Factorization for Recommendation Systems. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS), Reykjavic, Iceland, 22–25 April 2014. [Google Scholar]
Zhou, M.; Hannah, L.; Dunson, D.; Carin, L. Beta-negative binomial process and Poisson factor analysis. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS), La Palma, Canary Islands, 21–23 April 2012. [Google Scholar]
Doyle, G.; Elkan, C. Accounting for burstiness in topic models. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, Montreal, QC, Canada, 14–18 June 2009; pp. 281–288. [Google Scholar]

Figure 1. PDFs from Lemma 6 for location

μ_{1} = 0.1

and fixed

α

.

Figure 1. PDFs from Lemma 6 for location

μ_{1} = 0.1

and fixed

α

.

Figure 2. PDFs from Lemma 6 for variations with identical location and variance.

Figure 3. Equivalent versions of HDP-LDA. In (a), the arc from

\vec{β}

has a modified head to indicate that

Dirichlet (\vec{β})

is used in a nested manner.

Figure 3. Equivalent versions of HDP-LDA. In (a), the arc from

\vec{β}

has a modified head to indicate that

Dirichlet (\vec{β})

is used in a nested manner.

Figure 4. Equivalent versions of HDP-LDA. Concentration parameters

c_{X}

treated as constants or estimated. Indices

d = 1, \dots, D

,

n = 1, \dots, N_{d}

and

k = 1, \dots, \infty

. The

\vec{ϕ}

are indexed differently in the two versions. The

\vec{α}

and

{\vec{θ}}_{d}

are infinite probability vectors in the CRM representation of

G_{0}

and

G_{d}

, respectively.

Figure 4. Equivalent versions of HDP-LDA. Concentration parameters

c_{X}

treated as constants or estimated. Indices

d = 1, \dots, D

,

n = 1, \dots, N_{d}

and

k = 1, \dots, \infty

. The

\vec{ϕ}

are indexed differently in the two versions. The

\vec{α}

and

{\vec{θ}}_{d}

are infinite probability vectors in the CRM representation of

G_{0}

and

G_{d}

, respectively.

Figure 5. NP-LDA and its matrix factorisation counterpart. Concentration parameters

c_{X}

are constants or estimated, as are discounts

d_{X}

. Indices

d = 1, \dots, D

and

k = 1, \dots, \infty

. Vector-wise versions of the gamma and Poisson represent the gamma process and Poisson process, respectively.

Figure 5. NP-LDA and its matrix factorisation counterpart. Concentration parameters

c_{X}

are constants or estimated, as are discounts

d_{X}

. Indices

d = 1, \dots, D

and

k = 1, \dots, \infty

. Vector-wise versions of the gamma and Poisson represent the gamma process and Poisson process, respectively.

Table 1. General processes. Marginal is the corresponding infinitely divisible distribution for the total rate developed, for instance, using Theorem 1.

Name	Domain	Parameters	Rate (Lévy Measure)	Marginal
$b e P (M, α, β)$	$0 < λ < 1$	$0 \leq α < 1$ , $β > 0$	$M \frac{λ^{- α - 1}}{Γ (1 - α)} {(1 - λ)}^{α + β - 1}$	for $α$ = 0, $β$ = 1: $D i c k m a n (M)$
$G P (M, β)$	$λ > 0$	$β > 0$	$M λ^{- 1} e^{- λ β}$	$g a m m a (M, β)$
$G G P (M, α, β)$	$λ > 0$	$0 < α < 1$ , $β > 0$	$M \frac{α}{Γ (1 - α)} λ^{- α - 1} e^{- λ β}$	$T w e (α, M^{1 / α}, β)$
$s t a P (M, α)$	$λ > 0$	$α > 0$	$\frac{M α}{Γ (1 - α)} λ^{- α - 1}$	$p s t a b l e (α, M^{1 / α})$
$P P (M)$	$λ = 1$		$M$	$P o i s s o n (M)$
$N B P (M, ρ)$	$λ \in N^{+}$	$0 < ρ < 1$	$M \frac{- ρ^{λ}}{λ log (1 - ρ)}$	$N B (M, ρ)$

Table 2. Key formula for posterior analysis of CRMs,

Ψ_{J} = \int Pr ({\vec{n}}_{1 : J} \neq \vec{0} | λ) ρ (d λ)

, and the distribution on the remainder

T_{R} = \sum_{i = I + 1}^{\infty} λ_{i}

.

Table 2. Key formula for posterior analysis of CRMs,

Ψ_{J} = \int Pr ({\vec{n}}_{1 : J} \neq \vec{0} | λ) ρ (d λ)

, and the distribution on the remainder

T_{R} = \sum_{i = I + 1}^{\infty} λ_{i}

.

Name	$Ψ_{J}$	Remainder $T_{R}$
beP $(M, α, β)$ -BP	$M \sum_{j = 0}^{J - 1} \frac{Γ (α + β + j)}{Γ (1 + β + j)}$	$μ_{R} \sim beP (M, α, J + β)$
GP $(M, β)$ -PP	$M (log (J + β) - log β)$	$g a m m a (M, J + β)$
GGP $(M, α, β)$ -PP	$M ({(J + β)}^{α} - β^{α})$	$T w e (α, M^{1 / α}, J + β)$
GP $(M, β)$ -NBP $(ρ)$	$M (log (J log \frac{1}{1 - ρ} + β) - log β)$	$g a m m a (M, J log \frac{1}{1 - ρ} + β)$
GGP $(M, α, β)$ -NBP $(ρ)$	$M ({(J log \frac{1}{1 - ρ} + β)}^{α} - β^{α})$	$T w e (α, M^{1 / α}, J log \frac{1}{1 - ρ} + β)$
staP $(M, α)$ -PP	$M J^{α}$	$T w e (α, M^{1 / α}, J + β)$
staP $(M, α)$ -NBP $(ρ)$	$M {(J log \frac{1}{1 - ρ})}^{α}$	$T w e (α, M^{1 / α}, J log \frac{1}{1 - ρ})$

Table 3. Properties of processes.

Name	$κ_{n}$	$ψ (t)$	$T_{K}^{n}$
$beP (M, α, β)$	$M \frac{Γ (n - α)}{Γ (1 - α)} \frac{Γ (α + β)}{Γ (n + β)}$	$M \frac{Γ (α + β)}{Γ (β) α} (_{1} F_{1} (1 - α, β, t) - 1$	use Equation (8)
(for $β > 1 - α$ )		$+ \frac{1}{β}_{1} F_{1} (1 - α, β + 1, t))$
$GP (M, β)$	$M \frac{Γ (n)}{β^{n}}$	$M log (1 + t / β)$	$\frac{M^{K}}{β^{n}} S_{K}^{n}$
$GGP (M, α, β)$	$M \frac{Γ (n - α)}{Γ (1 - α)} \frac{α}{β^{n - α}}$	$M ({(β + t)}^{α} - β^{α})$	$\frac{{(M α β^{α})}^{K}}{β^{n}} S_{K, α}^{n}$
$staP (M, α)$	NA	$M t^{α}$	NA

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Buntine, W. Understanding Hierarchical Processes. Entropy 2022, 24, 1703. https://doi.org/10.3390/e24121703

AMA Style

Buntine W. Understanding Hierarchical Processes. Entropy. 2022; 24(12):1703. https://doi.org/10.3390/e24121703

Chicago/Turabian Style

Buntine, Wray. 2022. "Understanding Hierarchical Processes" Entropy 24, no. 12: 1703. https://doi.org/10.3390/e24121703

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Understanding Hierarchical Processes

Abstract

1. Introduction

2. Background Theory

2.1. Completely Random Measures

2.2. Poisson Point Process

2.3. Example Processes

3. Defining Processes Axiomatically

3.1. Subordinators

3.2. Axiomatic Definitions

3.3. On the Tweedie Distribution

3.4. Bayesian Analysis

4. Using Discrete Base Distributions

4.1. General Results

4.2. Results on Moments

4.3. The Gamma Process

4.4. General Chinese Restaurant Processes

Sampling over table configurations:

Sampling over table sizes:

5. Variants of the Generalised Gamma Process

The Hierarchical Context

6. Networks of Processes

6.1. Identifiability

6.2. Equivalences

6.3. Normalisation and Independence

6.4. Modelling LDA Using HDP

6.5. Example Equivalences with Non-Parametric LDA

7. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of Lemma 4

Appendix B. Proof of Corollary 1

Appendix C. Proof of Lemma 6

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI