1. Introduction
For any , with , and for any sequence such that for any and , is a discrete lattice random variable with support and probability mass function . If , X is a count random variable.
In most cases, the probability mass function
is not interesting, since it is difficult to deal with and there is no clear interpretation of the pattern of randomness it describes. The craft of probabilistic modelling (Gani (1986) [
1]) uses a diversity of criteria to describe and select models, namely, those arising from randomness patterns (such as counts in Bernoulli trials, sampling with or without replacement, and random draws from urns). Another source of the rationale description of count models are characterisation theorems based on structural properties (e.g., a power series distribution with mean = variance, or maximum Shannon entropy with prescribed arithmetic and/or geometric mean). Recurrence relationships (for instance,
) or mathematical properties (for instance, the variance being at most a quadratic function of the expectation) also define interesting families of discrete random variables. On the other hand, asymptotic properties such as arithmetic properties, namely, infinite divisibility, discrete self-decomposability, and stability, serve as guidance in model choice.
Section 2 describes the discrete uniform random variables, modelling equiprobability patterns resulting from the principle of insufficient reason, of which the Bernoulli random variable with parameter
is the simplest example.
Section 3 is a detailed overview of count models—Binomial, Negative Binomial—arising from the observation of random patterns in Bernoulli trials (including the Poisson random variable, as a limit of Binomial random variables under a mean stability restriction, and the Hypergeometric sampling without replacement model, herein in the context of conditioning on the sum of two independent random variables). In
Section 4, the recurrence relation holding for the probability mass function of Binomial, Poisson, and Negative Binomial random variables investigated by Katz [
2] and by Panjer [
3] is extended to describe Hess et al. [
4]’s family of basic count distributions.
Section 5 briefly discusses alternative organisations of count models, namely, via Power Series distributions or Kemp’s [
5] generalised hypergeometric probability distributions.
Section 6 constrasts the equilibrium pattern of Zipf’s [
6] law with the equiprobability modelled by the discrete uniform random variables.
Section 7 and
Section 8 discuss ways of transforming random models, respectively, by randomising parameters and via the discretisation of countinuous random variables.
Section 9 briefly discusses the role of characterisations in the craft of modelling count data. Further issues are briefly described in
Section 10.
2. Bernoulli Random Variables, Principle of Insufficient Reason and Discrete Uniform Random Variables
Let
be the probability of the event
occurring as the outcome of an experiment. Either
(sometimes referred to as success) occurs once, or its complementary event
(referred to as failure) occurs, so the number of occurrences of
in a single trial is either 1, with probability
p, or 0, with probability
. We shall use the notation
where the first line indicates the support of the count random variable
B, and the second line the probabilities of the support points.
The above random variable is called “Bernoulli”, with parameter
p, in honour of the genial probabilist Jacques Bernoulli, author of the fascinating
Ars Conjectandi [
7], published posthumously in 1713 by his nephew Niklaus Bernoulli (also a probabilist).
We shall use the notation ; if , meaning that and are equiprobable, it should be assumed that there is insufficient reason to assign different probabilities to and .
The principle of insufficient reason (renamed principle of indifference by Maynard Keynes [
8]) was used in the foundation texts of Jacques Bernoulli [
7] and Laplace [
9] for assigning epistemic probabilities to equiprobable events. The natural extension of
is the equiprobable count model
named Discrete Uniform with parameter
n, that we denote as
.
3. Count Random Variables in Bernoulli Trials
Let be a sequence of independent random experiments (trials) whose outcomes are either , with probability , or , with probability . The assumption of independence means that the outcome of trial has no influence whatsoever on the outcome of any other trial.
In this setting, the counts that matter are the following:
The number
k of outcomes
in
n trials (
n fixed,
k random). From the definition of Bernoulli trials, it is easy to conclude that, with
X denoting such a count random variable,
. This random variable, with probability mass function
, is called a binomial random variable with parameters
and
, and we shall denote it as
or
The expectation is
and the variance
. Hence, the dispersion index is
. For that reason, we say that the random variable
is underdispersed.
The number
n of trials needed to observe
k times the outcome
(
k fixed,
n random). The simple case is
. As in this case, with
denoting such a count random variable,
.
is called geometric (or sometimes Pascal) random variable, and we shall use the notation
, or
Its expectation is
and its variance is
.
More generally, let
be the number of trials needed for
being the
k-th
of outcomes. Obviously, due to the independence of the Bernoulli trials,
is a Negative Binomial random variable with parameters
k and
p, that we denote
, or
Obviously,
, with the
, independent. So
and
.
It is sometimes convenient to shift the Negative Binomial random variables to start at 0, i.e., to count the number of
s that precede the
k-th
. In other words, to define
Its dispersion index is
, and for that reason the Negative Binomial random variables are considered overdispersed.
Note also that using the gamma function extension of factorials, , , satisfying the recurrence relation , so that , we may consider Negative Binomial random variables with parameter :
In many situations, asymptotic results are paramount in modelling decisions or simplifications. The first central limit theorem (a name coined by Pólya [
10] in 1920), establishing that if
and
n is large, then
can be approximated by
, appeared in the second edition of Abraham de Moivre’s [
11]
The Doctrine of Chances.
The other important asymptotic result about Binomial sequences is Poisson’s [
12] law of rare events:
Let
, mean-stable in the sense that
(observe that this implies that
, and this is the rationale for the name “law of rare events”). Then,
Y is a Poisson random variable with parameter
, that we denote
.
, and hence, in what concerns dispersion, the Poisson random variable is a yardstick, in the sense that its dispersion index is
McCabe and Skeels (2020) [
13] and Di Noia et al. (2024) [
14] extensively investigated testing
Poissoness vs. overdispersion or underdispersion; cf. also Mijburgh and Visagie (2020) [
15]’s overview of goodness-of-fit tests for the Poisson distribution.
The Binomial, the Negative Binomial, and the Poisson random variables are the discrete members of Morris’ [
16] Natural Exponential Family (NEF) with quadratic variance function in the mean value (QVF). (Recall that
X is a member of NEF if its probability density function can be written as
, and therefore its cumulant generating function has the simple form
.) In the sequel, Morris [
17] treated “
topics that can be handled within this unified NEF-QVF formulation, including unbiased estimation, Bhattacharyya and Cramér-Rao lower bounds, conditional distributions and moments, quadratic regression, conjugate prior distributions, moments of conjugate priors and posterior distributions, empirical Bayes and minimax, marginal distributions and their moments, parametric empirical Bayes, and characterisations”, and this shows the relevance of Binomial, Negative Binomial, and Poisson count models in Statistical Inference.
It is also worth mentioning that Binomial, Negative Binomial, and Poisson random variables have relevant additive properties, in the sense that
If and are independent, then .
If and are independent, then .
If and are independent, then .
Furthermore, if
is subject to Binomial filtering or thinning, the resulting
Y has probability mass function
i.e.,
.
Many natural phenomena are subject to filtering; for instance, using a Poisson model for the number of eggs laid by turtles, the number of hatched eggs, the number of surviving newborns until they reach the ocean, the number of turtles surviving the first year, and so forth, can be modelled by a chain of filtered Poisson random variables, a very useful tool in population dynamics modelling.
On the other hand, the Negative Binomial is a “random Poisson”, obtained when its parameter is a random variable with a Gamma distribution: Let
with
. Then,
k = 0, 1, …, so we obtain
. For this reason, and as the Negative Binomial random variable is overdispersed, the Poisson being the yardstick, some authors in population dynamics modelling consider the Negative Binomial a more dispersed Poisson.
The above results on the sums of Morris’ discrete variable have interesting consequences when conditioning is applied:
If
and
are independent,
i.e.,
.
Reasoning in the same way, if
and
are independent,
This is the probability mass function of with .
If , and .
It is interesting to observe that if
,
, and
have the same expectation
, then,
In other words, for these three count models, the number of parameters implies a tradeoff between decreasing simplicity and increasing information. Augmenting the number of parameters supplies more information. However, one should always bear in mind that all models are wrong, but some are useful (as Box [
18] judiciously stated), and that the parsimony principle hints that the simplest model that is useful enough should be used. Observe that in model choice, criteria such as the Akaike Information Criterion (AIC) [
19] penalise the number of parameters of the model, and this often leads us to choose a model with fewer parameters, although it provides a better fit.
The Hypergeometric random variable is used in simple random sampling without replacement, while the Binomial is appropriate for simple random sampling with replacement. In the Hypergeometric context, we are dealing with exchangeability; in the Binomial setting, with a stronger independence concept.
The Hypergeometric random model is also the basis for estimating the size of a population with the technique of capture–recapture, cf. Seber and Schofield [
20]. Suppose that in a controlled investigation,
k animals are captured, marked, and released, and that after a while, there is a second instance of capturing, in which
j of the
n captured are marked (i.e., recaptured). If the unknown size of the population is
N,
k of which have been marked and
are unmarked,
j is the observed value of
, with
. The Peterson estimator
of the size of the population is a method of moments estimator (observe that we described a technique with just one recapture of
j marked animals, and in a sample
of size one, this is the mean).
4. Katz Count Models and Extensions
The probability mass function of the discrete Morris NEF-QVF Binomial, Poisson, and Negative Binomial random variables share the interesting property
Let
denote the non-degenerate count random variables whose probability mass function satisfies the recursive relation (
1).
Multiplying both sides of (
1) by
and adding for
, we find that the corresponding probability generating functions
are the solutions of the differential equation
; for details, cf. Pestana and Velosa (2004) [
21]; namely:
, , i.e., .
, , i.e., .
, , i.e., .
The simple expression (
1) for the successive probability atoms has been several times rediscovered in different contexts (McCabe and Skeels, 2020 [
13]). Katz (1945, 1965) [
2,
22] used it to organise a family of discrete models in the same spirit of the continuous Pearson family, and for that reason, the family of random variables
whose probability mass function satisfies (
1) is referred to as the Katz family. In Risk Theory and Insurance,
is known as Panjer’s class, due to Panjer’s (1981) [
3] important breakthrough that the recurrence relation (
1) implies that the distribution of aggregate claims can be iteratively computed or approximated:
Let
denote the identically distributed discrete claim sizes, with
and assume that the
are mutually independent and independent of the claim number
. Then,
where
denotes the
n-th convolution of
. Consequently, the probabilities
of the aggregate claim amount
(understood as
if
) can be obtained recursively using
Panjer’s algorithm:
This requires only
computations, while the traditional convolution method would require
. More generally, the cumulative distribution function
of the aggregate claim
for an arbitrary distribution function
of the individual claim sizes, when the number of claims is
with probability mass
, satisfies the integral equation
for
, if
. For nonnegative claim sizes with
,
For detailed proofs, cf. Rólski et al. [
23] (pp. 118–124). Klugman et al. [
24] (pp. 221–224) use an example to discuss in depth the use of the empiric dispersion index as guidance for model choice in the Panjer family.
If we relax the condition on the support,
, as investigated by Jewell [
25], Willmot [
26], or Sundt [
27], we also obtain the Logarithmic random variable
used by Fisher [
28] to model species abundance, and the Engen Extended Negative Binomial (ENB) [
29] random variable,
The dispersion indices of the Logarithmic and of the Engen random variables depend on
. For
,
and for
Hess et al. (2002) [
4] investigated
-Panjer classes,
, whose probability mass function satisfies
. Excluding the 0-Panjer family, which does not contain the logarithmic and ENB distributions, Hess et al. (2002) [
4] established that any
-Panjer distribution is the left endpoint truncation of an
-Panjer distribution, for
. For this reason, they call the Binomial, Poisson, Negative Binomial, Logarithmic, and ENB distributions basic count models.
Klugman et al.’s [
24] notation
has been used by Fackler (2024) [
30] to supply a unified, practical, and intuitive representation of the Panjer distributions and their parameter space, and give an inventory of parameterisations used for Panjer distributions.
Panjer’s recursion has been extended by Tzaninis and Bozikas (2024) [
31] to mixed compound distributions, following the use of finite mixtures for modelling dispersion in count data by Ong et al. (2023) [
32].
6. Uniformity and Power Law Randomness, a Striking Example of Opposite Patterns
The simplest equiprobability pattern of count randomness is modelled by the discrete uniform distribution ,
Equiprobability is, however, the exception; natural phenomena are more prone to present other patterns of equilibrium, such as with Zipf’s model, which is the special case
of the Zipf–Mandelbrot family of discrete models given by
where
,
k is the rank of the data and
is a generalisation of the harmonic number
. The cumulative distribution function is, for
,
, where
is the largest integer not greater than
x.
Zipf’s law has been introduced to model the equilibrium—in the sense that rank × frequency is approximately constant—in verbal communication, which employs mainly words of ordinary usage (social trend) and scarcely words tied to the user’s vocabulary and thematic preferences (individual trend). In what concerns extra-pair paternity (EPP), there is some evidence that the observed variability of the number of extra-pair offspring (EPO) in the brood, in various species of passerines, results from a delicate equilibrium of needing a social partner to care for raising the brood and the tendency to have offspring sired by a stronger male. This is the reason why the Zipf–Mandelbrot family seems to be a plausible model, cf. Marques et al. (2005) [
35].
Notice that Zipf’s law is also a particular case of the extended parameter space of the right-truncated logarithmic distribution where
for
and
In fact, when considering the truncated logarithmic model where
for
, the parameter space can be extended to
, as already observed. In particular, if
, we obtain Zipf’s law. For more information, cf., e.g., Johnson et al. (2005) [
33] (Chapter 11) and references therein. It is a power law in the sense that
, for some positive
, and therefore its log-log plot exhibits a linear signature. When an almost linear signature is observed, it is considered a strong support for choosing a power law model, namely, in the discrete context, a Zipf–Mandelbrot law.
9. Characterisations of Count Models
Families of discrete distributions have striking structural properties, with relevant consequences for model choice. Characterisation theorems are also helpful in the craft of probabilistic modelling.
As already observed, the dispersion index
of the Poisson distribution is 1. In fact, a PSD random variable
X has
if and only if
, for some
, a characterisation due to Kosambi (1949) [
39].
Characterisations are a useful tool in model choice. For instance, Kapur (1989) [
40] has shown that a discrete random variable
X with fixed arithmetic mean has maximum Shannon entropy
if and only if
X is a Geometric random variable. Observe that choosing the largest entropy model within a class of distributions amounts to selecting as default the least informative model, and so minimising the prior information.
A discrete random variable
X with fixed geometric mean has maximum Shannon entropy if and only if it is a Zipf or Discrete Pareto random variable
where
is the Riemann zeta function. A discrete random variable
X with fixed arithmetic and geometric means has maximum Shannon entropy if and only if
X is a Good random variable
where
, is the Lerch function (subsubsection 9.55 in Gradshteyn and Ryzhik (2007) [
37]). Observe that the Geometric random variable with support
is the special case when
, and the Logarithmic random variable is the case
.
10. Concluding Remarks
As in the continuous case, asymptotic results also play an important role in modelling; for instance, the Poisson limit in discrete settings is, to a certain extent, comparable to the central limit Gaussian law in general; cf. Steutel and van Harn (1979) [
41]. In the same spirit, extremal discrete laws do arise naturally in order statistics contexts; cf. Hall (1996) [
42]. In addition, useful models can originate other patterns of randomness via truncation (with the eventual broadening of the parameter space), randomly stopped sums, or randomly stopped extremes, and the randomisation of parameters.
Moreover among all continuous probability distributions with support
and mean
, the exponential distribution with
, modelling interarrival times in the Poisson process, has the largest differential entropy
meaning that the Poisson process has the greatest
-dimensional entropy in the interval
among all the homogeneous point processes with the same given intensity
. For that reason, Poisson streams of events are generally interpreted as representing the pattern of unconstrained randomness. This is a rationale for considering Poisson modelling, since entropy is non-decreasing, and therefore the Poisson random variable is a weak limit in very general frameworks, and in particular of Binomial and of Negative Binomial sequences with a mean stability condition. In a sense, the Poisson distribution models the unconstrained scattering of the type of throwing rice grains when seeding.
In Population Dynamics, dispersion is a clue to model choice. Zhang et al. (2018) [
43] investigated dispersion patterns of finite mixtures, and Cahoy et al. (2021) [
44] introduced flexible fractional Poisson distributions to model underdispersed and overdisersed count data. Recent advances in what concerns underdispersed count models were made by Huang (2022) [
45] and Seck et al. (2022) [
46]. Rana et al. (2023) [
47] investigated the influence of outliers on overdispersion, and Sengupta and Roy (2023) [
48] the role of under-reported count data and zero-inflated models, an issue also investigated by Aswi et al. (2022) [
49].
From the additive properties of Binomial random variables shown in
Section 3,
is
n-divisible, in the sense that it can be decomposed as the sum of
n independent and identically distributed
random variables; and
is
k-divisible, in the sense that it can be decomposed as the sum of
k independent and identically distributed
random variables. On the other hand
is infinitely divisible, since for any
, it can be decomposed as the sum of
n independent and identically distributed
random variables.
Moreover, from Raikov’s theorem [
50], if
is decomposed as the sum of two independent random variables,
, then
Y and
Z are also Poisson random variables. This means that the Poisson random variables are extreme points of the convex set of infinitely divisible random variables. A similar result with the Gaussian random variables, conjectured by Lévy [
51] and proved by Cramér [
52], shows that they are also extreme points of the set of infinitely divisible random variables. From this, Johansen (1966) [
53]) supplied a structural proof of the integral representations of infinitely divisible characteristics functions. In fact, the Poisson random variables are the building blocks of infinitely divisible random variables (de Finetti (1929, 1932) [
54,
55], which is a strong modelling asset, since many observed phenomena are the result of several contributing effects.