On the Discretization of Continuous Probability Distributions Using a Probabilistic Rounding Mechanism

Tovissodé, Chénangnon Frédéric; Honfo, Sèwanou Hermann; Doumatè, Jonas Têlé; Glèlè Kakaï, Romain

doi:10.3390/math9050555

Open AccessArticle

On the Discretization of Continuous Probability Distributions Using a Probabilistic Rounding Mechanism

by

Chénangnon Frédéric Tovissodé

¹

,

Sèwanou Hermann Honfo

^1,†,

Jonas Têlé Doumatè

^1,2,† and

Romain Glèlè Kakaï

^1,*

¹

Laboratoire de Biomathématiques et d’Estimations Forestières, Université d’Abomey-Calavi, Abomey-Calavi, Benin

²

Faculté des Sciences et Techniques, Université d’Abomey-Calavi, Abomey-Calavi, Benin

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2021, 9(5), 555; https://doi.org/10.3390/math9050555

Submission received: 15 January 2021 / Revised: 23 February 2021 / Accepted: 2 March 2021 / Published: 6 March 2021

(This article belongs to the Special Issue Stochastic Processes and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Most existing flexible count distributions allow only approximate inference when used in a regression context. This work proposes a new framework to provide an exact and flexible alternative for modeling and simulating count data with various types of dispersion (equi-, under-, and over-dispersion). The new method, referred to as “balanced discretization”, consists of discretizing continuous probability distributions while preserving expectations. It is easy to generate pseudo random variates from the resulting balanced discrete distribution since it has a simple stochastic representation (probabilistic rounding) in terms of the continuous distribution. For illustrative purposes, we develop the family of balanced discrete gamma distributions that can model equi-, under-, and over-dispersed count data. This family of count distributions is appropriate for building flexible count regression models because the expectation of the distribution has a simple expression in terms of the parameters of the distribution. Using the Jensen–Shannon divergence measure, we show that under the equidispersion restriction, the family of balanced discrete gamma distributions is similar to the Poisson distribution. Based on this, we conjecture that while covering all types of dispersions, a count regression model based on the balanced discrete gamma distribution will allow recovering a near Poisson distribution model fit when the data are Poisson distributed.

Keywords:

flexible count models; balanced gamma distribution; Jensen–Shannon divergence; latent equidispersion

1. Introduction

The regression analysis of count responses mostly relies on the Poisson model. However, the equidispersion (variance equals mean) assumption of the Poisson distribution makes Poisson regression inappropriate in many situations where data show overdispersion (variance greater than mean) or underdispersion (variance less than mean). Moreover, it has been observed that many data analyzed using overdispersion models (e.g., negative binomial [1]), which are as popular as the Poisson regression model, may be mixtures of overdispersed and underdispersed or equidispersed counts [2]. The implication is that appropriate alternatives to the Poisson model should allow variable dispersion, i.e., full dispersion flexibility [3]. Existing count regression models associated with variable dispersion exhibit some drawbacks. The first is improperly normalized probability mass functions for underdispersion situations (quasi-Poisson [4], Consul’s generalized Poisson [5], and extended Poisson–Tweedy regressions [6]), which makes inference approximate with quasi-models. Another drawback is the lack of a simple expression for the model mean value (Conway–Maxwell–Poisson [7], double Poisson [8,9], gamma count [10], semi-nonparametric Poisson polynomial [11], and discrete Weibull [12] models). The latter drawback motivated some research works where quantities other than the mean were modeled, leading to hardly interpretable fits [11,13].

The development of discrete analogues of continuous probability distributions, which has received great attention in the last two decades, provides an attractive route for building count regression models with variable dispersion. A review by [14] offers a survey of the different methods with their specific application fields. An appealing characteristic of discrete analogues of continuous distributions is the generation of discrete pseudo random values, which only requires basic operations once the continuous distribution can be simulated. This simulation easiness is especially interesting for simulation based model evaluation [15] or parametric bootstrapping based inference [16].

Despite the various existing discretization methods, mean parameterizable approaches necessary to build easily interpretable regression models are rare. For reliability evaluation, the discretizing approach of [17] attempts to match the mean and the variance of the discrete and the related continuous variable, but it provides only an approximate solution at the cost of a tuning parameter. The proposals in [3,18,19] offer solutions for constructing count variables with a fixed mean value and variable dispersion, but they lack a physical basis, i.e., a generating mechanism to motivate their use in practice.

This work proposes a discretization procedure to start from continuous probability distributions and construct count models with (i) properly normalized probability mass functions for underdispersion, equidispersion, as well as overdispersion situations and (ii) simple expressions for the model mean values. As a result, the proposal allows full likelihood inference (as opposed to quasi-likelihood inference) for any dispersion level in observed data and is thus suited for regression analysis where the estimation of covariate effects on the mean count is of great interest.

The proposed discretization approach modifies the “discrete concentration” method, i.e., “Methodology IV” in [14], to preserve the expectation of the continuous distribution. Our proposal, referred to as “balanced discretization” is based on a probabilistic rounding mechanism, which provides a generating mechanism with a simple interpretation. Interestingly, the probabilistic rounding mechanism, expressed as a simple stochastic representation in terms of the continuous distribution, allows easily generating pseudo random variates from the resulting balanced discrete distribution.

The rest of the paper is organized as follows. Section 2 motivates and presents the balanced discretization method. The general expressions of the distribution functions and moments of balanced discrete distributions are given. The method is applied to the gamma distributions in Section 3 to produce the balanced discrete gamma distribution, which is compared to the discrete concentration of the gamma distribution and to the Poisson distribution. Concluding remarks are given in Section 4.

2. The Balanced Discretization Method

This section motivates and describes the balanced discretization method. The general expressions of the probability mass, the cumulative distribution, the survival and quantile functions, the moments, and the index of dispersion are presented. The link between balanced discretization and the mean-preserving discretization approach of [3] is also established. The proofs of the results are routine and given for completeness in Appendix A.

2.1. Notations

We denote

Z

the set of integers (

Z = \{\dots, - 1, 0, 1, \dots\}

),

N

the set of non-negative integers (

N = \{0, 1, \dots\}

),

N_{+}

the set of natural numbers (

N_{+} = \{1, 2, \dots\}

),

R

the set of reals (

R = (- \infty, \infty)

), and

R_{+}

the set of positive reals (

R_{+} = (0, \infty)

). Let

⌊x⌋

be the integer part of any real x. Following [14], we denote continuous random variables by X and discrete random variables by Y. Accordingly,

f_{X} (\cdot | θ)

,

F_{X} (\cdot | θ)

,

S_{X} (\cdot | θ)

, and

Q_{X} (\cdot | θ)

denote respectively the pdf (probability density function), the cdf (cumulative distribution function), the suf (survival function), and the quf (quantile function) of X, whereas

f_{Y} (\cdot | θ)

,

F_{Y} (\cdot | θ)

,

S_{Y} (\cdot | θ)

, and

Q_{Y} (\cdot | θ)

denote respectively the pmf (probability mass function), the cdf, the suf, and the quf of Y, all indexed by a parameter vector

θ

. Continuous probability distributions are assumed to be non-degenerate. A Bernoulli random variable with success probability

p \in [0, 1]

is denoted

BER (p)

.

2.2. Reminders

First, we recall the discrete concentration method and the mean-preserving approach of [3]. Let

CD (θ)

be a continuous probability distribution of interest. The discrete concentration

DC (θ)

of

X \sim CD (θ)

is the count variable Y with the pmf and suf:

\begin{matrix} f_{Y} (y | θ) & = & F_{X} (y + 1 | θ) - F_{X} (y | θ) \end{matrix}

(1)

\begin{matrix} S_{Y} (y | θ) & = & S_{X} (y | θ) \end{matrix}

(2)

for

y \in Z

, i.e., Y has cdf

F_{Y} (y | θ) = F_{X} (y + 1 | θ)

and quf

Q_{Y} (u | θ) = ⌊Q_{X} (u | θ)⌋

for

u \in (0, 1)

. Accordingly, the nth moment about zero of Y is:

\begin{matrix} E [Y^{n}] & = & \sum_{y = - \infty}^{\infty} y^{n} [F_{X} (y + 1 | θ) - F_{X} (y | θ)] . \end{matrix}

(3)

Clearly, the discrete concentration of X is simply

Y \overset{d}{=} ⌊X⌋

where

\overset{d}{=}

means “equal in distribution to”. Thus,

Y = X - U

, where U is the fractional part of X. Since

U \in (0, 1)

, it satisfies

0 < E [U^{2}] < E [U] < 1

, providing bounds on the mean and the variance of the count variable:

E [X] - 1 \leq E [Y] \leq E [X]

and

Var [X] \leq Var [Y] \leq Var [X] + 1 / 4

[20].

The mean-preserved discrete version Y of X is the variable with the cdf:

\begin{matrix} F_{Y} (y | θ) = \int_{y}^{y + 1} F_{X} (x | θ) d x for y \in N . \end{matrix}

(4)

and expectation

E [Y] = E [X]

[3].

2.3. Motivating Example and Definition

Example 1

(Measuring tree diameter). Discretization mechanisms arise when measuring any continuous quantity. Indeed, no sample can cover a whole continuum since the latter has an infinite number of points, and only a finite number of decimal places are reported in practice [14,21]. Assume for instance an operator measuring tree diameters X in a forest inventory frame, using a measurement device scaled to millimeters (mm). Since X is a continuous variable, the probability of observing

X = x

mm is zero. When the true value x of the diameter of a tree actually falls between two consecutive graduations z and

z + 1

, the operator reports either

y = z

mm or

y = z + 1

mm, i.e., only a discretized version Y of X is observed. Beyond this example, when direct measures are taken, only the number of an arbitrary unit is actually counted. Clearly, the closer x is to z, the higher the probability of reporting

y = z

, and conversely, the closer x is to

z + 1

, the higher the probability of reporting

y = z + 1

. Balanced discretization results from assuming that given

z \leq x < z + 1

, the probability for reporting

y = z + 1

is exactly

x - z

.

Definition 1

(Balanced discretization). Let us consider an absolutely continuous probability distribution

CD (θ)

of interest. A count random variable Y is said to be distributed as the balanced discrete counterpart denoted

BD (θ)

of the continuous distribution

CD (θ)

, if it has the stochastic representation:

\begin{matrix} Y | U = u, X = x & \overset{d}{=} & z + u \\ U | X = x & \sim & BER (r) \\ X & \sim & CD (θ) \end{matrix}

(5)

where

z = ⌊x⌋

and

r = x - z

.

Let

E_{X} (n, y | θ)

denote the nth partial moment:

\begin{matrix} E_{X} (n, y | θ) = \int_{y}^{y + 1} x^{n} f_{X} (x | θ) d x \end{matrix}

(6)

of X over

(y, y + 1)

, and set:

\begin{matrix} H_{X} (y | θ) & = & F_{X} (y + 1 | θ) - F_{X} (y | θ) . \end{matrix}

(7)

The balanced discretization mechanism in Equation (5) preserves partial expectations

E_{X} (1, y | θ)

of the continuous variable as shown by Equation (10) of the following lemma.

Lemma 1.

Let X and Y be defined as in Equation (5). Then, for any

y \in Z

,

\begin{matrix} P (Y = y and y \leq X < y + 1) & = & (y + 1) H_{X} (y | θ) - E_{X} (1, y | θ) \end{matrix}

(8)

\begin{matrix} P (Y = y + 1 and y \leq X < y + 1) & = & E_{X} (1, y | θ) - y H_{X} (y | θ) \end{matrix}

(9)

\begin{matrix} E_{Y} [Y | y \leq X < y + 1] & = & E_{X} (1, y | θ) \end{matrix}

(10)

where

E_{Y} [Y | X \in A]

is the partial mean of Y for

X \in A

.

2.4. Probability Mass and Distribution Functions

We derive in this section some general distributional properties of balanced discrete distributions.

Proposition 1

(Distribution function). Let

Y \sim BD (θ)

. The pmf, the cdf, the suf, and the quf of Y are given for

y \in Z

and

0 \leq u \leq 1

by:

\begin{matrix} f_{Y} (y | θ) & = & (y - 1) F_{X} (y - 1 | θ) - 2 y F_{X} (y | θ) + (y + 1) F_{X} (y + 1 | θ) \\ + E_{X} (1, y - 1 | θ) - E_{X} (1, y | θ) \end{matrix}

(11)

\begin{matrix} F_{Y} (y | θ) & = & F_{X} (y | θ) + (y + 1) H_{X} (y | θ) - E_{X} (1, y | θ) \end{matrix}

(12)

\begin{matrix} S_{Y} (y | θ) & = & S_{X} (y | θ) - (y - 1) H_{X} (y - 1 | θ) + E_{X} (1, y - 1 | θ) \end{matrix}

(13)

\begin{matrix} Q_{Y} (u | θ) & = & \{\begin{matrix} x_{o} & if u_{o} \geq u \\ x_{o} + 1 & otherwise \end{matrix} \end{matrix}

(14)

where

x_{o} = ⌊Q_{X} (u | θ)⌋

and

u_{o} = F_{Y} (x_{o} | θ)

.

Note from Equation (11) that

BD (θ)

assigns less probability mass to zero than the discrete concentration of

X \sim CD (θ)

if X has support

R_{+}

or

(0, M)

for

M \in R_{+}

. Equation (13) emphasizes that the balanced discretization method does not preserve the suf of the continuous distribution, unlike the discrete concentration (see Equation (2)). Nevertheless, the balanced discrete cdf and suf satisfy the inequalities

F_{X} (y | θ) \leq F_{Y} (y | θ) \leq F_{X} (y + 1 | θ)

(with equalities when the support of X is upper bounded by y) and

S_{X} (y | θ) \leq S_{Y} (y - 1 | θ) \leq S_{X} (y - 1 | θ)

(with equalities when the support of X is lower bounded by y).

By Equation (14), balanced discretization somewhat preserves the median of the continuous distribution. Indeed, if X has an integral median

m_{X}

, then Y has median

m_{Y} = m_{X} - 1 / 2

. More generally, we have

m_{Y} = ⌊m_{X}⌋ + 1 / 2

if

F_{Y} (⌊m_{X}⌋ | θ) < 1 / 2

,

m_{Y} = ⌊m_{X}⌋

if

F_{Y} (⌊m_{X}⌋ | θ) = 1 / 2

and

m_{Y} = ⌊m_{X}⌋ - 1 / 2

if

F_{Y} (⌊m_{X}⌋ | θ) > 1 / 2

.

2.5. Moments and Index of Dispersion

This section presents expressions for the moments of balanced discrete distributions. We start with the first two moments since they are the most important in a count regression context.

Proposition 2

(Mean and variance). Let

X \sim CD (θ)

with mean

μ_{X} (θ)

and variance

σ_{X}^{2} (θ)

. The balanced discrete counterpart of X,

Y \sim BD (θ)

, has mean

μ_{Y} (θ) = μ_{X} (θ)

and variance:

\begin{matrix} σ_{Y}^{2} (θ) = σ_{X}^{2} (θ) + ζ_{0} (θ) \end{matrix}

(15)

where

ζ_{0} (θ) = E_{X} [R (1 - R)]

with

R \overset{d}{=} X - ⌊X⌋

. In addition,

ζ_{0} (θ)

satisfies the inequality

0 < ζ_{0} (θ) < m i n \{μ_{Y} (θ), 1 / 4\}

and is given by the sum

ζ_{0} (θ) = \sum_{z = - \infty}^{\infty} ζ_{0} (z, θ)

with:

\begin{matrix} ζ_{0} (z, θ) = (2 z + 1) E_{X} (1, z | θ) - E_{X} (2, z | θ) - z (z + 1) H_{X} (z | θ) . \end{matrix}

(16)

From Proposition 2, it appears that when the variance of a balanced discrete variable Y exists, it satisfies

σ_{X}^{2} (θ) < σ_{Y}^{2} (θ) < σ_{X}^{2} (θ) + m i n \{μ_{Y} (θ), 1 / 4\}

. This suggests the inexpensive approximation

{\hat{σ}}_{Y}^{2} (θ) = σ_{X}^{2} (θ) + m i n \{\hat{μ} / 2, 1 / 8\}

with

| {\hat{σ}}_{Y}^{2} (θ) - σ_{Y}^{2} (θ) | < m i n \{\hat{μ} / 2, 1 / 8\}

,

\hat{μ}

being the mean of Y (exact or estimate). The following corollary infers the ID (index of dispersion) of a balanced discrete distribution from Proposition 2.

Corollary 1

(Index of dispersion). Let

Y \sim BD (θ)

be the balanced discrete counterpart of

X \sim CD (θ)

with cdf

F_{X} (\cdot | θ)

, quf

Q_{X} (\cdot | θ)

, expectation

μ_{X} (θ) \neq 0

, and index of dispersion (variance-to-mean ratio)

{ID}_{X} (θ)

. The index of dispersion

{ID}_{Y} (θ)

of Y satisfies:

\begin{matrix} {ID}_{Y} (θ) & = & {ID}_{X} (θ) + \frac{ζ_{0} (θ)}{μ_{X} (θ)} \end{matrix}

(17)

\begin{matrix} | {ID}_{X} (θ) | & < & | {ID}_{Y} (θ) | \leq | {ID}_{X} (θ) | + \frac{1}{4 | μ_{X} (θ) |} . \end{matrix}

(18)

Furthermore,

ζ_{0} (θ)

can be approximated with a tolerance

α \in (0, 1)

by the truncated sum:

\begin{matrix} {\hat{ζ}}_{α} (θ) & = & \sum_{z = z_{i}}^{z_{f}} ζ_{0} (z, θ) \end{matrix}

(19)

where

z_{i} = ⌊Q_{X} (α / 2 | θ)⌋

,

z_{f} = ⌊Q_{X} (1 - α / 2 | θ)⌋ + 1

and α controls the precision of

{\hat{ζ}}_{α} (θ)

via

| {\hat{ζ}}_{α} (θ) - ζ_{0} (θ) | < 1 - F_{X} (z_{f} + 1 | θ) + F_{X} (z_{i} | θ)

.

The next proposition shows the relation between moments of balanced discrete distributions and of discrete concentrations.

Proposition 3.

Let

Y \sim BD (θ)

be the balanced discrete counterpart of

X \sim CD (θ)

. The nth moment of Y satisfies for

n \in N_{+}

:

\begin{matrix} μ_{Y}^{(n)} (θ) & = & μ_{Z}^{(n)} (θ) + \sum_{i = 0}^{n - 1} (\binom{n}{i}) μ_{Z U}^{(i)} (θ) \end{matrix}

(20)

where

μ_{Z}^{(n)} (θ)

is the nth moment of the discrete concentration

Z = ⌊X⌋

(Equation (3)) and

μ_{Z U}^{(i)} (θ)

is the expectation of the product of

Z^{i}

and U with

U | X \sim BER (X - Z)

, given by

μ_{Z U}^{(i)} (θ) = - μ_{Z}^{(i + 1)} (θ) + \sum_{z = - \infty}^{\infty} z^{i} E_{X} (1, z | θ)

.

2.6. Conditional Distributions of Latent Continuous and Binary Outcomes

Although the balanced discrete distribution was motivated by the need for mean-parameterizable flexible discrete probability distributions, it may be used to model any continuous outcome measured to fewer decimal places. In such instances, the conditional distribution and in particular the conditional mean of the underlying continuous distribution may be useful for predicting the continuous variable given an observed discrete value. In addition, since a balanced discrete variable is the observable feature of an underlying continuous outcome, a useful tool for maximum likelihood inference in complex models is the expectation-maximization algorithm [22], which handles any latent class-like model. In a Bayesian inference framework, the stochastic representation of the balanced discrete distribution can also be useful for sampling the posterior distribution of model parameters when draws from the truncated form of the continuous distribution are inexpensive. The following result provides expressions for these purposes.

Proposition 4.

Let X, U, and Y be defined as in Equation (5). Then, for

y \in Z

with probability mass

f_{Y} (y | θ) > 0

:

\begin{matrix} f_{U | Y} (u | Y = y, θ) & = & p_{y}^{u} {[1 - p_{y}]}^{1 - u} for u \in \{0, 1\}, \end{matrix}

(21)

\begin{matrix} f_{X | Y} (x | Y = y, θ) & = & \frac{f_{X} (x | θ)}{f_{Y} (y | θ)} [(1 - y + x) I_{(y - 1, y)} (x) + (1 + y - x) I_{(y, y + 1)} (x)], \end{matrix}

(22)

and for

n \in R

such that

X^{n}

is well defined in both

(y - 1, y)

and

(y, y + 1)

,

\begin{matrix} E_{X | Y} [X^{n} | Y = y, θ] & = & \frac{1}{f_{Y} (y | θ)} [(1 - y) E_{X} (n, y - 1 | θ) + E_{X} (n + 1, y - 1 | θ) \\ + (1 + y) E_{X} (n, y | θ) - E_{X} (n + 1, y | θ)] \end{matrix}

(23)

where

p_{y} = {[f_{Y} (y | θ)]}^{- 1} [E_{X} (1, y - 1 | θ) - (y - 1) H_{X} (y | θ)]

is the conditional mean (success probability) of the Bernoulli variable U given

Y = y

and

I_{A} (x)

is the indicator function, which equals one if

x \in A

and 0 otherwise.

Note from Equation (22) that given the continuous variable, the distribution of the discrete variable does not depend on the parameter vector

θ

:

\begin{matrix} f_{Y | X} (y | X = x, θ) = (1 - y + x) I_{(y - 1, y)} (x) + (1 + y - x) I_{(y, y + 1)} (x) . \end{matrix}

(24)

Therefore, in the expectation-maximization algorithm framework, the maximization of the joint likelihood:

\begin{matrix} f_{X, Y} (x, y) = f_{Y | X} (y | X = x, θ) f_{X} (x | θ) \end{matrix}

(25)

of Y and X is reduced to the maximization of the likelihood

f_{X} (x | θ)

of the continuous variable X. Hence, the expectation-maximization algorithm will be appropriate for fitting a balanced discrete distribution whenever fitting the underlying continuous distribution is easy.

2.7. Link with Mean-Preserving Discretization

Recall from Equation (4) that the cdf of the mean-preserved count variable has the form:

\begin{matrix} F_{Y} (y | θ) = \int_{y}^{y + 1} F_{X} (x | θ) d x . \end{matrix}

From the identity

\int F_{X} (x | θ) d x = x F_{X} (x | θ) - \int x f_{X} (x | θ) d x

, we have:

\begin{matrix} F_{Y} (y | θ) = (y + 1) F_{X} (y + 1 | θ) - y F_{X} (y | θ) - E_{X} (1, y | θ) . \end{matrix}

(26)

Then, using the identity

F_{X} (y + 1 | θ) = F_{X} (y | θ) + [F_{X} (y + 1 | θ) - F_{X} (y | θ)]

straightforwardly results in:

\begin{matrix} F_{Y} (y | θ) = F_{X} (y | θ) + (y + 1) [F_{X} (y + 1 | θ) - F_{X} (y | θ)] - E_{X} (1, y | θ), \end{matrix}

i.e., the cdf in Equation (12). Therefore, balanced discretization as defined in Equation (5) provides a generating mechanism for the mean-preserving method of [3].

3. The Balanced Discrete Gamma Family

The class of gamma distributions is a flexible class of distributions encountered in various statistical applications. This class of distributions includes as special cases the exponential, the one-parameter gamma, and up to rescaling the chi-squared distributions [23]. Chakraborty and Chakravarty [24] studied the discrete concentration of the gamma distribution with applications in biology and socio-economics. In order to allow exact inference in flexible count regression models, we apply in this section the balanced discretization method to gamma distributions. We present the balanced discrete gamma distribution under mean parametrization convenient for regression purposes and compare the distribution to the discrete concentration of the gamma distribution and to the Poisson distribution.

Let

G (b, a)

denote the gamma distribution with cdf

F_{g} (x | b, a) = γ (a x, b)

for

x > 0

, where

γ (x, a) = \int_{0}^{x} u^{a - 1} e^{- u} d u / Γ (a)

is the lower incomplete gamma ratio for

(a, x) \in R_{+}^{2}

and

Γ (a) = \int_{0}^{\infty} u^{a - 1} e^{- u} d u

is the gamma function. A random variable

X \sim G (b, a)

has expectation

b / a

, variance

b / (a^{2})

, and nth order partial moment

E_{g} (n, y | b, a) = E_{X} [X^{n} | y \leq X < y + 1]

given by:

\begin{matrix} E_{g} (n, y | b, a) & = & \frac{Γ (b + n)}{a^{n} Γ (b)} [γ (a (y + 1), b + n)) - γ (a y, b + n)] . \end{matrix}

(27)

The one-parameter gamma distribution is obtained for

a = 1

(equidispersion) and is denoted

G (b)

.

3.1. The Balanced Discrete Gamma Distribution

A count random variable with support

N

is said to follow a balanced discrete gamma distribution denoted

BG (μ, a)

for

(μ, a) \in R_{+}^{2}

, if it is generated by the discretization mechanism in Equation (5) with

X \sim G (a μ, a)

. By Proposition 2, a

BG (μ, a)

variable has expectation

μ

. Using Equation (27), some properties of

BG (μ, a)

follow as in Corollary 2 hereafter.

Corollary 2

(Balanced discrete gamma distribution). Let

Y \sim BG (μ, a)

, and set

b = a μ

. Then, the pmf and the cdf of Y are respectively given for

y \in N

by:

\begin{matrix} f_{d g} (y | μ, a) & = & (y - 1) γ (a (y - 1), b) - 2 y γ (a y, b) + (y + 1) γ (a (y + 1), b) \\ - μ [γ (a (y - 1), b + 1) - 2 γ (a y, b + 1) + γ (a (y + 1), b + 1)] \end{matrix}

(28)

\begin{matrix} F_{d g} (y | μ, a) & = & (y + 1) γ (a (y + 1), b) - y γ (a y, b) \\ - μ [γ (a (y + 1), b + 1) - γ (a y, b + 1)]; \end{matrix}

(29)

and the variance and the index of dispersion of Y are respectively given by:

\begin{matrix} σ_{d g}^{2} (μ, a) & = & a^{- 1} μ + ζ_{g_{0}} (μ, a) \end{matrix}

(30)

\begin{matrix} {ID}_{d g} (μ, a) & = & a^{- 1} + μ^{- 1} ζ_{g_{0}} (μ, a) \end{matrix}

(31)

where:

\begin{matrix} ζ_{g_{0}} (μ, a) & = & \sum_{z = 0}^{\infty} ζ_{g_{0}} (z, μ, a) with \end{matrix}

(32)

\begin{matrix} ζ_{g_{0}} (z, μ, a) & = & - μ (μ + a^{- 1}) [γ (a (z + 1), b + 2) - γ (a z, b + 2)] \\ μ (2 z + 1) [γ (a (z + 1), b + 1) - γ (a z, b + 1)] \\ - z (z + 1) [γ (a (z + 1), b) - γ (a z, b)] . \end{matrix}

(33)

The pmf (28) and the cdf (29) follow by Equations (11) and (12) respectively along with Equation (27). Note that the computations of the pmf and the cdf of a balanced discrete gamma distribution only require a routine for the incomplete gamma ratio

γ (\cdot, \cdot)

, which is available in most statistical software as the cdf of the continuous gamma distribution (e.g., pgamma in the R freeware [25] and gamcdf in MATLAB [26]). The variance (30) and the index of dispersion (31) follow by Equation (15) along with Equation (27). Note that the variance term

ζ_{g_{0}} (μ, a)

in Equation (32) can be approximated via the truncation mechanism in Equation (19) with a tolerance

α \in (0, 1)

as:

\begin{matrix} {\hat{ζ}}_{g_{α}} (μ, a) & = & \sum_{z = z_{i}}^{z_{f}} ζ_{g_{0}} (z, μ, a) \end{matrix}

(34)

where

z_{i} = ⌊γ^{- 1} (α / 2, b) / a⌋

,

z_{f} = ⌊γ^{- 1} (1 - α / 2, b) / a⌋ + 1

with

γ^{- 1} (\cdot, b)

the inverse function of the incomplete gamma ratio

γ (\cdot, b)

(available, e.g., as qgamma in R and gaminv in MATLAB).

The one-parameter BDG (balanced discrete gamma) distribution, denoted

BG (μ)

and obtained by setting

a = 1

, corresponds to a latent equidispersion mechanism and is marginally slightly overdispersed as indicated by Equation (31) with

a = 1

. Setting

a = μ^{- 1}

produces the balanced discrete exponential distribution

BE (μ)

, which is close to the geometric distribution since the latter corresponds to the discrete concentration of the exponential distribution [27].

Figure 1 displays the probability mass function of the BDG distributions with mean values

μ = 2.5

and

μ = 5

(computed using Equation (28) in R). It appears that the scale parameter a controls the shape of the distribution, allowing both unimodal and reverse J shapes. It can be observed that the spread of a BDG distribution

BG (μ, a)

decreases with a for fixed

μ

. The boxplots depicted in Figure 2 show that the skewness of the distribution (as measured by the difference between the mean and the median values) increases with its spread and thus decreases with a. As expected, both the length of the right tail of the distribution (Figure 1) and the probability of unusual (far from the mean and the median) observations (Figure 2) increase with the spread.

The index of dispersion, which can assume any positive value, is computed using Equation (31) in R and depicted in Figure 3 for

a \geq 1

. It can be observed in accordance with Equation (31) that for large mean values (

μ \geq 10

), the variance-to-mean ratio is not very sensitive to

μ

for fixed, but not too large scale parameter values (

a < 50

). This also holds for very low scales (

a < 0.5

) for any mean value. Very large scales (

a \geq 50

) induce, in addition to severe underdispersion (

ID < 0.2

), an oscillating index of dispersion with an amplitude approaching zero as

μ

increases. Indeed, for large-scale values (as

a \to \infty

) and finite

μ

, the continuous gamma distribution converges to a point mass at

μ

. The balanced discrete gamma distribution thus converges to a binary variable, which takes the values

y = ⌊μ⌋

with the probability

1 - (μ - ⌊μ⌋)

and

y = ⌊μ⌋ + 1

with the probability

(μ - ⌊μ⌋)

and has variance

σ_{d g}^{2} (μ, \infty) = (μ - ⌊μ⌋) (1 - μ + ⌊μ⌋)

. The oscillating pattern in Figure 3 accordingly shows that the variance is minimal when

μ

is an integer and maximal when

μ

is a half integer. When both

μ

and a are large, the amplitude of the oscillations of

ID

decreases, and

ID

approaches zero, in accordance with Equation (31).

3.2. Comparison with Some Alternatives

The balanced discretization approach results from a light modification of the discrete concentration method. This section assesses on the one hand to what extent the two discretization approaches differ, considering the balanced discrete gamma (BDG) distribution case. On the other hand, the difference between the Poisson and the BDG distributions is evaluated under both latent and marginal equidispersion restrictions.

Among the miscellaneous measures proposed to assess the similarities between probability distributions, the Jensen–Shannon divergence (JSD) [28] has many desirable properties that support its use in statistics [29]. The JSD is an information theory measure given for two pmfs

q_{1} (\cdot)

and

q_{2} (\cdot)

by [28]:

\begin{matrix} J S D (q_{1}, q_{2}) & = & K (q_{1}, q_{2}) + K (q_{2}, q_{1}) \end{matrix}

(35)

\begin{matrix} where K (q_{1}, q_{2}) & = & \sum_{y = - \infty}^{\infty} q_{1} (y) {log}_{2} (\frac{q_{1} (y)}{0.5 q_{1} (y) + 0.5 q_{2} (y)}) \end{matrix}

(36)

with the convention

q_{1} (y) {log}_{2} (q_{1} (y) / (0.5 q_{1} (y) + 0.5 q_{2} (y))) = 0

if

q_{1} (y) = 0

. The JSD measures the discrepancy between

q_{1} (\cdot)

and

q_{2} (\cdot)

in bits. It is bounded as

0 \leq J S D (q_{1}, q_{2}) \leq 2

and is zero only if

q_{1} (y) = q_{2} (y) \forall y \in Z

. The JSD values presented in this section are computed using Equation (35) using the R freeware.

3.2.1. Balanced Discretization Versus Discrete Concentration

Figure 4 illustrates the balanced discretization method using the continuous gamma distribution with parameters

a = 1

,

b = 5

. Unlike the discrete concentration whose cdf lies above the continuous cdf, the balanced discrete distribution is constructed so that the continuous cdf interpolates the cdf of the balanced discrete distribution.

The curves in Figure 5A show the JSD measure between the balanced discrete gamma and the corresponding discrete concentrations for mean count values

μ \leq 30

. The selected scale values allow a wide range for the index of dispersion (ID), which roughly runs from 0.04 to 10. It can be observed that the JSD measure is relatively low (

JSD < 0.80

bit) and decreases overall with the mean count (but not monotonically). In other words, the discrete concentration and balanced discretization methods produce similar discrete analogues of the considered continuous gamma distributions for large mean values. The JSD measure is especially low (

JSD \leq 0.10

bit) for equidispersed and overdispersed balanced discrete gamma distributions (

a \leq 1

). High discrepancy (

JSD \geq 0.5

bit) actually appears between the discrete analogues from the two discretization methods generally in underdispersion situations with a very low mean count (

μ < 1

) or large scale parameter (

a > 5

, implying

ID < 0.45

). For very large scale values (

a \geq 30

),

JSD

becomes erratic, oscillating between minima right before the integer values of

μ

and maxima right after the integer values of

μ

.

3.2.2. Distance to the Poisson Distribution under Equidispersion

The Poisson regression is the most common count regression model, which is appropriate for equidispersed count data. Although the BDG distribution does not include the Poisson distribution as a special case, the distribution can be restricted to allow equidispersion. This can be achieved by solving the nonlinear equation

σ_{d g}^{2} (μ, a) = μ

for a using Equation (30) (marginal equidispersion). However, the one-parameter BDG distribution

BG (μ)

offers a conceptually insightful alternative (latent equidispersion), which is analytically tractable (

a = 1

).

In order to determine which equidispersion balanced discrete gamma model (marginal vs. latent equidispersion) is the most appropriate when seeking the parsimonious flexible count regression model, the

JSD

measure was computed for fixed mean count

μ

between the Poisson distribution and the BDG distribution under marginal, as well as latent equidispersion restrictions. The results displayed in Figure 5B against the mean count indicate that when restricted to be equidispersed, the BDG distribution becomes similar to the Poisson distribution as per the low JSD values (

JSD < 0.015

bit). It can be observed that the marginally equidispersed BDG distribution is the closest to the Poisson distribution only for a very low mean count (

μ \leq 1.1176

). For a larger mean count (

μ > 1.1176

), the one-parameter BDG distribution is closer to the Poisson distribution than the marginally equidispersed BDG distribution, although the difference becomes unnoticeable for

μ > 10

.

It appears that the one-parameter BDG distribution based count regression model will be an effective parsimonious (few parameters and more tractable) model [30] that can be fit to observed data to check the appropriateness of an equidispersion model. Therefore, while a BDG regression model will allow exact inference in flexible count modeling, testing for latent equidispersion will allow recovering a near Poisson regression model when supported by observed data.

4. Conclusions

With a view toward allowing exact inference in flexible count regression models, this work describes balanced discretization, a method for simulating and modeling integer valued data starting from a continuous random variable, through the use of a probabilistic rounding mechanism. Most of the existing alternatives were built to conserve a specific characteristic of the continuous variable, e.g., the failure rate [31] and the survival [14] functions for modeling reliability data. Our proposal preserves expectation and is thus appropriate for count regression. The method is very close to the discretizing approach of [17], which also preserves the mean value, but requires an a priori double truncation of the continuous variable and introduces a tuning parameter. Physical interpretation is an important selection criterion for choosing an appropriate discretization method [32]. As such, our proposal is motivated by a real-world generating mechanism and provides a physical interpretation for the mean-preserving method of [3]. Although balanced discrete distributions can model any count data, it may not be appropriate for aging data for which the integer part or the ceil is generally used [14] so that discrete concentrations are a better choice.

The flexibility of the balanced discrete gamma family developed from the continuous gamma distribution illustrates the potential of the balanced discretization method for capturing any level of dispersion in observed count data. In addition to flexibility, the balanced discrete gamma family turns out to be similar to the Poisson distribution when restricted to be equidispersed (marginal equidispersion) and when constructed using an equidispersed continuous gamma distribution (latent equidispersion). Based on this, we conjecture that while covering all types of dispersion, a flexible count regression model based on the balanced discrete gamma distribution will allow recovering a near Poisson distribution model when the data are Poisson distributed. Future research will target the use of the balanced discrete distribution in count regression analysis. The extension of balanced discretization to a multivariate setting is also considered to handle count data grouped by some sampling units and mixtures of count and continuous responses.

Author Contributions

Conceptualization, C.F.T. and R.G.K.; methodology, C.F.T.; software, C.F.T.; validation, C.F.T., S.H.H., J.T.D., and R.G.K.; formal analysis, C.F.T.; resources, R.G.K.; writing—original draft preparation, C.F.T.; writing—review and editing, C.F.T., S.H.H., J.T.D., and R.G.K.; visualization, C.F.T.; supervision, R.G.K.; project administration, R.G.K.; funding acquisition, R.G.K. All authors read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The authors confirm that the data supporting the findings of this work are available within the article.

Acknowledgments

The first author gratefully acknowledges financial support from the Centre d’Excellence d’Afrique en Sciences Mathematiques, Informatique et Applications (CEA-SMIA). The authors are grateful to four anonymous reviewers for their comments and constructive suggestions, which significantly improved the presentation of the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

$BER$	Bernoulli (distribution)
$BE$	Balanced discrete exponential (distribution)
$BD$	Balanced discrete (distribution)
$BG$	Balanced discrete gamma (distribution)
BDG	Balanced discrete gamma
$CD$	Continuous distribution (distribution)
$DC$	Discrete concentration (distribution)
$G$	Gamma (distribution)
ID	Index of dispersion
JSD	Jensen–Shannon divergence
pdf	probability density function
pmf	probability mass function
quf	quantile function
suf	survival function

Appendix A. Proofs of Lemmas and Propositions

Appendix A.1. Proof of Lemma 1

Proof of Lemma 1.

By the definition in Equation (5), the probability of observing

Y = y

given that

X = x

with

y \leq x < y + 1

is

1 - r = 1 - x + y

with

r = x - y

. Thus, the probability of observing

Y = y

and

y \leq X < y + 1

is the integral of

[(y + 1) - x] f_{X} (x | θ)

with respect to x over

(y, y + 1)

, i.e.,

\begin{matrix} P (Y = y and y \leq X < y + 1) & = & \int_{y}^{y + 1} [(y + 1) - x] f_{X} (x | θ) d x \\ = & (y + 1) \int_{y}^{y + 1} f_{X} (x | θ) d x - \int_{y}^{y + 1} x f_{X} (x | θ) d x \end{matrix}

which proves Equation (8). Using the same argument on the probability of observing

Y = y + 1

given that

y \leq X < y + 1

leads to

P (Y = y + 1 and y \leq X \leq y + 1)

equaling the integral of

(x - y) f_{X} (x | θ)

with respect to x over

(y, y + 1)

yields Equation (9). Next, since Y is discrete and takes one of the two values y and

y + 1

when

y \leq X < y + 1

, the partial expectation of Y is:

\begin{matrix} E_{Y} [Y | y \leq X < y + 1] & = & y P (Y = y and y \leq X < y + 1) \\ + (y + 1) P (Y = y + 1 and y \leq X < y + 1) \\ = & P (Y = y + 1 and y \leq X < y + 1) \\ + y [P (Y = y and y \leq X < y + 1) \\ + P (Y = y + 1 and y \leq X < y + 1)] . \end{matrix}

Replacing

P (Y = y and y \leq X < y + 1) + P (Y = y + 1 and y \leq X < y + 1)

by the equivalent probability

P (y \leq X < y + 1) = F_{X} (y + 1 | θ) - F_{X} (y | θ)

and using Equation (9) to obtain

P (Y = y + 1 and y \leq X < y + 1)

result in Equation (10). □

Appendix A.2. Proof of Proposition 1

Proof of Proposition 1.

It follows from the defining mechanism in Equation (5) that the unique ways to obtain

Y = y

are (

U = 1

and

y - 1 \leq X < y

) and (

U = 0

and

y \leq X < y + 1

). In other words,

Y = y

is equivalent to

y - 1 \leq X < y

or

y \leq X < y + 1

. Since the two instances are mutually exclusive, this gives:

\begin{matrix} f_{Y} (y | θ) & = & P (Y = y and y - 1 \leq X < y) + P (Y = y and y \leq X < y + 1) \\ = & E_{X} (1, y - 1 | θ) - (y - 1) [F_{X} (y | θ) - F_{X} (y - 1 | θ)] \\ + (y + 1) [F_{X} (y + 1 | θ) - F_{X} (y | θ)] - E_{X} (1, y | θ) \end{matrix}

where the second equality follows from replacing y by

y - 1

in Equation (9) to compute the probability

P (Y = y and y - 1 \leq X < y)

and using Equation (8) to obtain

P (Y = y and y \leq X < y + 1)

. Rearranging the right-hand side of the last equation as

f_{Y} (y | θ) = (y - 1) F_{X} (y - 1 | θ) + [- (y - 1) - (y + 1)] F_{X} (y | θ) + (y + 1) F_{X} (y + 1 | θ) + E_{X} (1, y - 1 | θ) - E_{X} (1, y | θ)

yields Equation (11). Again, using the defining mechanism in Equation (5), it follows that

Y \leq y

is equivalent to

X < y

or

{Y = y and y \leq X < y + 1}

. Since the two instances are mutually exclusive, this results in using Equation (8) in:

\begin{matrix} F_{Y} (y | θ) & = & P (X < y) + P (Y = y and y \leq X < y + 1) \\ = & F_{X} (y | θ) + (y + 1) [F_{X} (y + 1 | θ) - F_{X} (y | θ)] - E_{X} (1, y | θ) \end{matrix}

which proves Equation (12) and implies that (a)

F_{X} (y | θ) < F_{Y} (y | θ) < F_{X} (y + 1 | θ)

. The suf is obtained as

S_{Y} (y | θ) = P (X \geq y) + P (Y = y and y - 1 \leq X < y)

from the definition

S_{Y} (y | θ) = P (Y \geq y)

, which straightforwardly results in Equation (13) on replacing

P (X \geq y) = S_{X} (y | θ)

and using Equation (9) properly to compute

P (Y = y and y - 1 \leq X < y)

. From the definition of the quantile function for

0 \leq u \leq 1

as

Q_{Y} (u | θ) = inf \{y \in Z | F_{Y} (y | θ) \geq u\}

,

y = Q_{Y} (u | θ)

implies the inequality

F_{Y} (y - 1 | θ) < u \leq F_{Y} (y | θ)

. Let

q_{o} = Q_{X} (u | θ)

, and set

x_{o} = ⌊q_{o}⌋

and

u_{o} = F_{Y} (x_{o} | θ)

. By the inequality (a), we have on the one hand, (b) if

u = F_{Y} (y - 1 | θ)

, then

q_{o} \in (y - 1, y)

, and on the other hand, (c) if

u = F_{Y} (y | θ)

, then

q_{o} \in (y, y + 1)

. Since

F_{Y} (\cdot | θ)

is increasing, (b) and (c) result in

q_{o} \in (y - 1, y + 1)

, and thus,

x_{o} \in {y - 1, y}

or equivalently

y \in {x_{o}, x_{o} + 1}

. Hence,

Q_{Y} (u | θ) = x_{o}

if

u_{o} \geq u

and

Q_{Y} (u | θ) = x_{o} + 1

otherwise. □

Appendix A.3. Proof of Proposition 2

Proof of Proposition 2.

Applying the law of iterated expectations [33] (Equation (2)) to the representation in Equation (5), we have

μ_{Y} (θ) = E_{X} [E_{U | X} [Y]]

. However,

E_{U | X} [Y] = E_{U | X} [Z + U] = Z + E_{U | X} [U]

with

Z = ⌊X⌋

. Then, from

E_{U | X} [U] = R = X - Z

, we get

E_{U | X} [Y] = X

, which results in

μ_{Y} (θ) = E_{X} [X]

and proves that

μ_{Y} (θ) = μ_{X} (θ)

. Using Equation (3) in [33], we have

σ_{Y}^{2} (θ) = {Var}_{X} [E_{U | X} [Y]] + E_{X} [{Var}_{U | X} [Y]]

. Equation (15) then follows from using

{Var}_{X} [E_{U | X} [Y]] = {Var}_{X} [X]

and

{Var}_{U | X} [Y] = R (1 - R)

. Moreover, R satisfies

0 \leq R < 1

and

{Var}_{U | X} [Y] > 0

so that

0 < R (1 - R) < 1 / 4

, but also

R (1 - R) < R \leq X

; hence,

0 < ζ_{0} (θ) < 1 / 4

and

ζ_{0} (θ) < E [X]

, and

0 < ζ_{0} (θ) < m i n \{E [X], 1 / 4\}

follows. Using

R = X - Z

gives

R (1 - R) = R - R^{2} = (2 Z + 1) X - X^{2} - Z (Z + 1)

. Then, with

f_{X} (\cdot | θ)

, the pdf of X,

\begin{matrix} ζ_{0} (θ) & = & E_{X} [R (1 - R)] \\ = & \int_{- \infty}^{\infty} [(2 z + 1) x - x^{2} - z (z + 1)] f_{X} (x | θ) d x with z = ⌊x⌋ \\ = & \sum_{z = - \infty}^{\infty} \int_{z}^{z + 1} [(2 z + 1) x - x^{2} - z (z + 1)] f_{X} (x | θ) d x \\ = & \sum_{z = - \infty}^{\infty} \{(2 z + 1) \int_{z}^{z + 1} x f_{X} (x | θ) d x - \int_{z}^{z + 1} x^{2} f_{X} (x | θ) d x \\ - z (z + 1) \int_{z}^{z + 1} f_{X} (x | θ) d x\}; \end{matrix}

hence, Equation (16) follows. □

Appendix A.4. Proof of Proposition 3

Proof of Proposition 3.

Using Equation (5),

Y^{n}

can be represented as

Y^{n} = {(Z + U)}^{n}

, which expands as

Y^{n} = \sum_{i = 0}^{n} (\binom{n}{i}) Z^{i} U^{n - i}

giving

Y^{n} = Z^{n} + \sum_{i = 0}^{n - 1} (\binom{n}{i}) Z^{i} U

since

U^{j} = U

for

j \in N_{+}

, and Equation (20) follows. Next, using the law of iterated expectations, we have

μ_{Z U}^{(i)} (θ) = E_{Z} [Z^{i} E_{U | Z} [U]]

. Since

Z = ⌊X⌋

,

Z = z

is equivalent to

z \leq X < z + 1

, hence

E_{Z U} [U | Z = z] = E_{X, U} [U | z \leq X < z + 1]

. However, we have by Equation (9) the identity

E_{X, U} [U | z \leq X < z + 1] = - z [F_{X} (z + 1 | θ) - F_{X} (z | θ)] + E_{X} (1, z | θ)

so that we get the identity

Z^{i} E_{Z, U} [U | Z = z] = - z^{i + 1} [F_{X} (z + 1 | θ) - F_{X} (z | θ)] + z^{i} E_{X} (1, z | θ)

. Summing the latter partial expectations for

z \in Z

yields

μ_{Z U}^{(i)} (θ) = - μ_{Z}^{(i + 1)} (θ) + \sum_{z = - \infty}^{\infty} z^{i} E_{X} (1, z | θ)

since

F_{X} (z + 1 | θ) - F_{X} (z | θ)

is the probability mass of the discrete concentration of X (see Equation (1)). □

Appendix A.5. Proof of Proposition 4

Proof of Proposition 4.

Given

Y = y

, U remains Bernoulli distributed. Moreover,

Y = y

and

U = 1

are equivalent to

Y = y

and

y - 1 \leq X < y

. The success probability of U given

Y = y

is thus

p_{y} = {[f_{Y} (y | θ)]}^{- 1} [P (Y = y and y - 1 \leq X < y)]

by Bayes’s rule. Note the identity

P (Y = y and y - 1 \leq X < y) = E_{X} (1, y - 1 | θ) - (y - 1) [F_{X} (y | θ) - F_{X} (y - 1 | θ)]

, which follows by Equation (9) on using

y - 1

instead of y. The expression of

p_{y}

then follows as given in Equation (21). From Equation (5), the conditional density of Y given

X = z

and

U = u

is

f_{Y | X, U} (y | X = x, U = u) = I_{(z, z + 1)} (x)

with

z = y - u

. The likelihood (joint density and probability mass) of X, U, and Y is thus:

\begin{matrix} f_{X, U, Y} (x, u, y) & = & f_{U | X} (u | X = x) f_{X} (x | θ) I_{(y - u, y - u + 1)} (x) \\ = & {(x - ⌊x⌋)}^{u} {(1 + ⌊x⌋ - x)}^{1 - u} f_{X} (x | θ) I_{(y - u, y - u + 1)} (x) \\ = & {(x - y + u)}^{u} {(1 + y - u - x)}^{1 - u} f_{X} (x | θ) I_{(y - u, y - u + 1)} (x) \end{matrix}

where the first line follows by Bayes’ rule, and the last line follows on using

y = ⌊x⌋ + u

. Summing

f_{X, U, Y} (x, u, y)

over

u \in {0, 1}

, we get:

\begin{matrix} f_{X, Y} (x, y) & = & (1 - y + x) f_{X} (x | θ) I_{(y - 1, y)} (x) + (1 + y - x) f_{X} (x | θ) I_{(y, y + 1)} (x) \end{matrix}

The pdf in Equation (22) then follows by Bayes’ rule as

f_{X | Y} (x | Y = y, θ) = f_{X, Y} (x, y) / f_{Y} (y | θ)

. Finally, Equation (23) follows from a direct integration as:

\begin{matrix} E_{X | Y} [X^{n} | Y = y, θ] & = & \frac{1}{f_{Y} (y | θ)} [\int_{y - 1}^{y} x^{n} (1 - y + x) f_{X} (x | θ) d x \\ + \int_{y}^{y + 1} x^{n} (1 + y - x) f_{X} (x | θ) d x] \\ = & \frac{1}{f_{Y} (y | θ)} [(1 - y) \int_{y - 1}^{y} x^{n} f_{X} (x | θ) d x + \int_{y - 1}^{y} x^{n + 1} f_{X} (x | θ) d x \\ + (1 + y) \int_{y}^{y + 1} x^{n} f_{X} (x | θ) d x - \int_{y}^{y + 1} x^{n + 1} f_{X} (x | θ) d x] . \end{matrix}

□

References

Hilbe, J.M. Negative Binomial Regression; Cambridge University Press: Cambridge, UK, 2011; p. 573. [Google Scholar]
Sellers, K.F.; Shmueli, G. Data dispersion: Now you see it ⋯ now you don’t. Commun. Stat. Theory Methods 2013, 42, 3134–3147. [Google Scholar] [CrossRef] [Green Version]
Hagmark, P.E. On construction and simulation of count data models. Math. Comput. Simul. 2008, 77, 72–80. [Google Scholar] [CrossRef]
Wedderburn, R.W. Quasi-likelihood functions, generalized linear models, and the Gauss—Newton method. Biometrika 1974, 61, 439–447. [Google Scholar] [CrossRef]
Consul, P.; Famoye, F. Generalized Poisson regression model. Commun. Stat. Theory Methods 1992, 21, 89–109. [Google Scholar] [CrossRef]
Bonat, W.H.; Jørgensen, B.; Kokonendji, C.C.; Hinde, J.; Demétrio, C.G. Extended Poisson–Tweedie: Properties and regression models for count data. Stat. Model. 2018, 18, 24–49. [Google Scholar] [CrossRef] [Green Version]
Conway, R.W.; Maxwell, W.L. A queuing model with state dependent service rates. J. Ind. Eng. 1962, 12, 132–136. [Google Scholar]
Efron, B. Double exponential families and their use in generalized linear regression. J. Am. Stat. Assoc. 1986, 81, 709–721. [Google Scholar] [CrossRef]
Zou, Y.; Geedipally, S.R.; Lord, D. Evaluating the double Poisson generalized linear model. Accid. Anal. Prev. 2013, 59, 497–505. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Winkelmann, R. A count data model for gamma waiting times. Stat. Pap. 1996, 37, 177–187. [Google Scholar] [CrossRef]
Cameron, A.C.; Johansson, P. Count data regression using series expansions: With applications. J. Appl. Econom. 1997, 12, 203–223. [Google Scholar] [CrossRef] [Green Version]
Klakattawi, H.; Vinciotti, V.; Yu, K. A simple and adaptive dispersion regression model for count data. Entropy 2018, 20, 142. [Google Scholar] [CrossRef] [Green Version]
Zeviani, W.M.; Ribeiro, P.J., Jr.; Bonat, W.H.; Shimakura, S.E.; Muniz, J.A. The Gamma-count distribution in the analysis of experimental underdispersed data. J. Appl. Stat. 2014, 41, 2616–2626. [Google Scholar] [CrossRef] [Green Version]
Chakraborty, S. Generating discrete analogues of continuous probability distributions-A survey of methods and constructions. J. Stat. Distrib. Appl. 2015, 2, 30. [Google Scholar] [CrossRef] [Green Version]
Plan, E.L. Modeling and simulation of count data. CPT Pharmacomet. Syst. Pharmacol. 2014, 3, 1–12. [Google Scholar] [CrossRef] [PubMed]
Veraart, A.E. Modeling, simulation and inference for multivariate time series of counts using trawl processes. J. Multivar. Anal. 2019, 169, 110–129. [Google Scholar] [CrossRef] [Green Version]
Roy, D.; Dasgupta, T.A. A discretizing approach for evaluating reliability of complex systems under stress-strength model. IEEE Trans. Reliab. 2001, 50, 145–150. [Google Scholar] [CrossRef]
Hagmark, P.E. A new concept for count distributions. Stat. Probab. Lett. 2009, 79, 1120–1124. [Google Scholar] [CrossRef] [Green Version]
Hagmark, P.E. An Exceptional Generalization of the Poisson Distribution. Open J. Stat. 2012, 2, 313–318. [Google Scholar] [CrossRef] [Green Version]
Chakraborty, S. A New Discrete Distribution Related to Generalized Gamma Distribution and Its Properties. Commun. Stat. Theory Methods 2013, 44, 1691–1705. [Google Scholar] [CrossRef]
Holland, B.S. Some Results on the discretization of continuous probability distributions. Technometrics 1975, 17, 333–339. [Google Scholar] [CrossRef]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–38. [Google Scholar] [CrossRef]
Lawless, J.F. Statistical Models and Methods for Lifetime Data, 2nd ed.; Wiley Series in Probability and Statistics; Wiley-Interscience: New York, NY, USA, 2002. [Google Scholar]
Chakraborty, S.; Chakravarty, D. Discrete Gamma Distributions: Properties and Parameter Estimations. Commun. Stat. Theory Methods 2012, 41, 3301–3324. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2020. [Google Scholar]
MATLAB. Version 9.0.0 (R2016a); The MathWorks Inc.: Natick, MA, USA, 2016. [Google Scholar]
Roy, D. Reliability measures in the discrete bivariate set-up and related characterization results for a bivariate geometric distribution. J. Multivar. Anal. 1993, 46, 362–373. [Google Scholar] [CrossRef] [Green Version]
Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef] [Green Version]
Zhang, X.; Delpha, C.; Diallo, D. Performance of Jensen Shannon Divergence in Incipient Fault Detection and Estimation. In Proceedings of the ICASSP 2019, 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 2742–2746. [Google Scholar]
Daganzo, C.F.; Gayah, V.V.; Gonzales, E.J. The potential of parsimonious models for understanding large scale transportation systems and answering big picture questions. EURO J. Transp. Logist. 2012, 1, 47–65. [Google Scholar] [CrossRef] [Green Version]
Roy, D.; Ghosh, T. A new discretization approach with application in reliability estimation. IEEE Trans. Reliab. 2009, 58, 456–461. [Google Scholar] [CrossRef]
Bracquemond, C.; Gaudoin, O. A survey on discrete lifetime distributions. Int. J. Reliab. Qual. Saf. Eng. 2003, 10, 69–98. [Google Scholar] [CrossRef]
Kaehler, J. Laws of iterated expectations for higher order central moments. Stat. Pap. 1990, 31, 295–299. [Google Scholar] [CrossRef]

Figure 1. Probability mass plots for the balanced discrete gamma distributions with mean values

μ = 2.5

(left panel) and

μ = 5

(right panel) and scales a selected to yield an index of dispersion (ID, variance-to-mean ratio) of ID = 4 (bottom row), ID = 1 (central row), and ID = 0.5 (top row).

Figure 1. Probability mass plots for the balanced discrete gamma distributions with mean values

μ = 2.5

(left panel) and

μ = 5

(right panel) and scales a selected to yield an index of dispersion (ID, variance-to-mean ratio) of ID = 4 (bottom row), ID = 1 (central row), and ID = 0.5 (top row).

Figure 2. Box plots for the balanced discrete gamma distributions with mean values

μ = 2.5

and

μ = 5

and an index of dispersion (ID, variance-to-mean ratio) of ID = 4, ID = 1, and ID = 0.5. The thick vertical bar inside the interquartile range (i.e., the rectangular box that has sides that are 25% (left side) and 75% (right side) quartiles) is the median of the distribution.

Figure 2. Box plots for the balanced discrete gamma distributions with mean values

μ = 2.5

and

μ = 5

and an index of dispersion (ID, variance-to-mean ratio) of ID = 4, ID = 1, and ID = 0.5. The thick vertical bar inside the interquartile range (i.e., the rectangular box that has sides that are 25% (left side) and 75% (right side) quartiles) is the median of the distribution.

Figure 3. Index of dispersion (ID, variance-to-mean ratio) of the balanced discrete gamma distribution against the mean value (

μ

) for selected scale parameter (a) values in the range

[1, 1000]

mostly corresponding to equidispersion and underdispersion (

0 < ID \leq 1

).

Figure 3. Index of dispersion (ID, variance-to-mean ratio) of the balanced discrete gamma distribution against the mean value (

μ

) for selected scale parameter (a) values in the range

[1, 1000]

mostly corresponding to equidispersion and underdispersion (

0 < ID \leq 1

).

Figure 4. Comparison of the cumulative distribution functions of the balanced discrete gamma and discrete concentration of gamma distributions based on a continuous gamma distribution with scale

a = 1

and shape

b = 5

.

Figure 4. Comparison of the cumulative distribution functions of the balanced discrete gamma and discrete concentration of gamma distributions based on a continuous gamma distribution with scale

a = 1

and shape

b = 5

.

Figure 5. Jensen–Shannon divergence (JSD in bit) measured between the balanced discrete gamma (BDG) distribution and the corresponding discrete concentration (A) and between the Poisson and the BDG distributions under both latent equidispersion (one-parameter BDG distribution) and marginal equidispersion (unit variance-to-mean ratio) restrictions (B), against the mean value (

μ

).

Figure 5. Jensen–Shannon divergence (JSD in bit) measured between the balanced discrete gamma (BDG) distribution and the corresponding discrete concentration (A) and between the Poisson and the BDG distributions under both latent equidispersion (one-parameter BDG distribution) and marginal equidispersion (unit variance-to-mean ratio) restrictions (B), against the mean value (

μ

).

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tovissodé, C.F.; Honfo, S.H.; Doumatè, J.T.; Glèlè Kakaï, R. On the Discretization of Continuous Probability Distributions Using a Probabilistic Rounding Mechanism. Mathematics 2021, 9, 555. https://doi.org/10.3390/math9050555

AMA Style

Tovissodé CF, Honfo SH, Doumatè JT, Glèlè Kakaï R. On the Discretization of Continuous Probability Distributions Using a Probabilistic Rounding Mechanism. Mathematics. 2021; 9(5):555. https://doi.org/10.3390/math9050555

Chicago/Turabian Style

Tovissodé, Chénangnon Frédéric, Sèwanou Hermann Honfo, Jonas Têlé Doumatè, and Romain Glèlè Kakaï. 2021. "On the Discretization of Continuous Probability Distributions Using a Probabilistic Rounding Mechanism" Mathematics 9, no. 5: 555. https://doi.org/10.3390/math9050555

APA Style

Tovissodé, C. F., Honfo, S. H., Doumatè, J. T., & Glèlè Kakaï, R. (2021). On the Discretization of Continuous Probability Distributions Using a Probabilistic Rounding Mechanism. Mathematics, 9(5), 555. https://doi.org/10.3390/math9050555

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On the Discretization of Continuous Probability Distributions Using a Probabilistic Rounding Mechanism

Abstract

1. Introduction

2. The Balanced Discretization Method

2.1. Notations

2.2. Reminders

2.3. Motivating Example and Definition

2.4. Probability Mass and Distribution Functions

2.5. Moments and Index of Dispersion

2.6. Conditional Distributions of Latent Continuous and Binary Outcomes

2.7. Link with Mean-Preserving Discretization

3. The Balanced Discrete Gamma Family

3.1. The Balanced Discrete Gamma Distribution

3.2. Comparison with Some Alternatives

3.2.1. Balanced Discretization Versus Discrete Concentration

3.2.2. Distance to the Poisson Distribution under Equidispersion

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Proofs of Lemmas and Propositions

Appendix A.1. Proof of Lemma 1

Appendix A.2. Proof of Proposition 1

Appendix A.3. Proof of Proposition 2

Appendix A.4. Proof of Proposition 3

Appendix A.5. Proof of Proposition 4

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI