1. Introduction
The notion of the surprise index (SI) is not new in the literature, but has not been discussed thoroughly due to a lack of applicability and complexity in deriving for probability models that do not conform to well-known generating functions. The scarcity of scholarly works in this direction is reminiscent of this fact. The earliest reference dates back to 1948, when [
1] asserted that an event with a low probability may be rare but is not surprising.
Interestingly enough, research on this topic is very limited. Some pertinent references are given as follows: Ref. [
2] generalized and derived the SI for the multivariate normal distribution, but it has a different expression and notion. Ref. [
3] derived the SIs for the binomial and Poisson distributions but without adequate details. Ref. [
4] discussed the SI for the negative binomial distribution. Ref. [
5] discussed the role of the SI in the context of macro-surprises from a monetary economics perspective. From the above-cited references, one may arrive at the conclusion that finding the SI is difficult to achieve analytically and subsequently requires the assistance of a powerful and efficient computing environment, such as
Mathematica, which is utilized in this paper to obtain closed-form expressions for probability distributions in both the discrete domain and in the continuous domain other than those that have already been discussed.
In this article, we aim to discuss, in adequate details, the computation of SIs for various discrete probability distributions, including the binomial, negative binomial, and Poisson distributions (i.e., those that have been at least discussed in the literature), and SIs for zero-truncated Poisson, geometric, Hermite, and Skellam distributions, which are new contributions to the current topic. In addition, we also provide an analogous expression for deriving the SI for univariate continuous probability models using the definition given in Equation (
19) and defined later. For illustrative purposes, we compute SIs for various well-known univariate absolutely continuous probability models using Equation (
19). It appears that, in most of the cases, the resulting expression of the SI associated with each of the discrete probability distributions is available in closed form, involving special functions and infinite series wherever applicable. Furthermore, we provide some empirical studies of the SIs corresponding to several discrete probability models. We conjecture that a similar development can be made in terms of identifying SIs for bivariate and/or multivariate continuous probability models, which will be the subject matter of a separate article. In summary, the major contributions of this article on the topic of SIs can be summarized as follows:
We revisit the computation of the SIs for the binomial, Poisson, and negative binomial distributions and provide the correct expression of the SI for the Poisson distribution using Mathematica.
Surprise indices are computed for the geometric and negative binomial, zero-truncated Poisson, and Hermite (for which closed-form expressions involving special functions and/or infinite series are available) distributions, while for the generalized Poisson distribution, the associated SI is not available in closed form, and a numerical solution is to be searched for. All of these derivations are new contributions to this topic.
In addition, we provide the derivation of SIs for univariate continuous probability models using an analogous expression based on the geometric mean of a random variable.
Finally, we conduct empirical studies on SIs for several of the discrete distributions with varying parameter choices, and several useful observations are derived accordingly.
The remainder of this article is organized in the following manner: In
Section 2, we provide the computational details of deriving the SI for each of the univariate discrete probability models assumed in this paper with empirical studies on several of such probability models. In
Section 3, we derive the SI for a continuous probability model based on the definition according to [
2] and provide some useful conjectures on the properties of SIs.
Section 4 presents several potential applications of the SI in a practical setting along with some potential challenges to extending this definition in bivariate and higher domains. Finally, some concluding remarks are presented in
Section 5.
2. Surprise Index Derivation: Preliminaries
We begin this section by providing the definition of SI. According to [
1], the SI,
is defined as the comparison of the expected probability and the observed probability, which has the following form:
where
and
represents the probability that an event
has actually occurred. The expression in Equation (
1) of the SI is from [
3]. This feature can be obtained for discrete probability distributions in computing their corresponding probability generating functions, a strategy which is discussed later. Based on a suggestion by an anonymous reviewer, alternatively, Equation (
1) can be rewritten as
Noticeably, this form is also independently obtained in [
1].
Next, we revisit the computation of the SIs for the binomial, negative binomial, and Poisson distributions that have been independently discussed and derived in [
3,
4]. Proceeding in the same manner, we derive SIs for the zero-truncated Poisson, geometric, Hermite, and Skellam distributions. The process of obtaining the SI involves the following steps (for details, see [
3]):
Step 1: Calculate the generating function of which is of the form from a given probability mass function (p.m.f.).
Step 2: Set
to obtain the following quantity
which is the numerator of Equation (
1), where
.
Step 3: Integrate the simplified quantity on the R.H.S. obtained in Step 2, from 0 to .
Then, substitute the value obtained in Step 3 to the numerator of Equation (
1). Observe that, since the rationale behind this strategy of obtaining the SI has already been discussed in [
3], it is not discussed here.
Next, this simple process is carried out below, for each of the discrete probability distributions selected for this purpose. It is important to note that the goal of the above steps is to obtain an expression for the sum of , which involves solving the integral in step 3. In the next subsection, we begin by revisiting the SI for a binomial distribution at first.
2.1. Surprise Index for a Binomial Distribution
The binomial distribution is denoted as
, with
being the number of trials and
being the probability of success resulting from each trial. The associated probability mass function (p.m.f.) is
where
is the number of successes, with
. The associated generating function will be
Then, following steps two and three (given earlier) and simplifying, we obtain
where
is the Gauss hypergeometric function, and
if
and
if
Therefore, the SI for the binomial distribution related to the
i-th probability is (on substituting Equation (
2) in the numerator of Equation (
1)):
For illustrative purposes, we assume some representative values of
p and subsequently compute the associated values of
for a fixed value of
and for varying choices
m,
p, and
q in Equation (
3), which are reported in
Table 1.
From
Table 1, we can observe the following:
For fixed with decreasing, the corresponding SI values increase, which is expected.
For fixed values of p and as the number of successes increase and with decreasing, the SI values increase.
2.2. Surprise Index for a Negative Binomial Distribution
The negative binomial distribution is denoted as
, with
as the number of successes until the experiment is terminated and
being the probability of success for each experiment. The associated p.m.f. is
where
is the number of failures. Consequently, the generating function will be
Proceeding as before, we obtain
using
Mathematica, where
is defined in Equation (
3).
Thus, the SI for the negative binomial distribution is, on substituting Equation (
5) in the numerator of Equation (
1),
Assuming several representative values of
p and
and substituting various values for
in Equation (
6), we find the following values of
for this distribution, which are presented in
Table 2.
From
Table 2, one may observe the following:
The SI values are dependent on the magnitude of either or both of p and .
For fixed as increases, the SI values decrease for varying
For with q increasing, the SI value increases.
For with and q decreasing, as m decreases, the SI values increase.
2.3. Surprise Index for a Poisson Distribution
The associated p.m.f. is
where
is the number of occurrences and
. The associated generating function is
Proceeding as before,
where
is the zero-order modified Bessel function of the first kind.
Therefore, the SI for the Poisson distribution on substituting Equation (
7) in the numerator of Equation (
1) is
Substituting various values for
and
m in Equation (
8), we find the following values of
for this distribution, given in
Table 3.
For a fixed with m increasing and decreasing, the SI values increase.
For a fixed with increasing, the SI values decrease.
For a comprehensive view of the SI in this case, further empirical studies are required.
2.4. Surprise Index for a Zero-Truncated Poisson Distribution
The zero-truncated Poisson distribution is denoted as
with parameter
. The p.m.f. is
where
is the number of occurrences; for a detailed study on this distribution, see [
6]. The associated generating function will be
Proceeding as before, the numerator of Equation (
1) in this case, will be
where
has been defined earlier in the previous subsection. Therefore, upon substituting Equation (
9) in the numerator of Equation (
1), the SI for the zero-truncated Poisson distribution will be
Substituting various representative values for
and
m in Equation (
10), we find the following values of
for this distribution, which is presented in
Table 4.
From
Table 4, one can observe the following:
The SI values are slightly different from the Poisson distribution’s SI values. Also, we see that smaller values of generate greater differences between the zero-truncated Poisson and the Poisson SI values.
The behavior/changing pattern of the SI values are exactly the same (except for the magnitude) as in the previous case (Poisson distribution), for varying choices of m and .
2.5. Surprise Index for a Geometric Distribution
The geometric distribution is denoted as
, with
being the number of Bernoulli trials needed to achieve one success. The associated p.m.f. is
where
is the number of successes. The generating function is then found to be
Consequently, the numerator of Equation (
1) will be
on using
Mathematica.
Hence, on substituting Equation (
11) in the numerator of Equation (
1), we have the following expression for the SI for the geometric distribution:
Assuming various representative values for
m and
in Equation (
12), we find the following values of
for this distribution, which are given in
Table 5.
From
Table 5, one may observe the following:
For fixed with and with m increasing, the SI values exhibit an increasing pattern.
For fixed with q decreasing, the SI values increase.
2.6. Surprise Index for a Hermite Distribution
The Hermite distribution is denoted as
with parameters
and
. This distribution is used to measure count data using more than one parameter and has been used in biological research. There are several scholarly studies related to this distribution that exist in the literature. For example, Ref. [
7] discussed several useful structural properties of the Hermite distribution and they established the fact that this distribution is the generalized Poisson distribution. Ref. [
8] have discussed the utility of this distribution in the context of a zero-inflated overdispersed probability model. Ref. [
9] developed an R package
hermite to apply generalized hermite distribution in modeling real-world scenario(s) of fitting count data in the presence of overdispersion or multimodality with a lot more added flexibility in terms of inference under the classical method. The associated p.m.f. of the random variable
is
where
and
is the integer part of
and
are the parameters associated with the two independent Poisson variables
, respectively. The associated generating function is given by
Proceeding as before, the numerator of Equation (
1) will be
where
is the regularized hypergeometric distribution, obtained using
Mathematica. Therefore, upon substituting Equation (
17) in the numerator of Equation (
1), the SI for the Hermite distribution will be
Substituting various values for
in Equation (
14), one can find values of
for this distribution, which is not reported in this paper for brevity. Also, it is quite difficult to obtain numerically, as it involves infinite sums and special functions.
2.7. Surprise Index for a Skellam Distribution
The Skellam distribution, also known as the Poisson difference distribution, is derived from the difference of two Poisson random variables (for details, see [
10]) and is denoted as
with parameters
and
. This distribution may be used for describing the point spread distribution for sports such as hockey, where all points scored are equal, describing the statistics of the difference of two images with simple photon noise, or studying treatment effects, as discussed in [
10]. The p.m.f. when considering two Poisson random variables is given by
where
m is an integer and
is the
m-th order modified Bessel function of the first kind. The associated generating function will be
Again, by proceeding as before, the numerator of Equation (
1) can be derived using the infinite series expression for the exponential function and using
Mathematica,as follows:
Subsequently, upon substituting Equation (
15) in the numerator of Equation (
1), the SI for the Skellam distribution can be written as
Substituting various values for
in Equation (
16), one can find expressions of the SI for this distribution. However, from Equation (
16), it is clear that it would be difficult to obtain numerical values as the expression involves infinite sum and gamma functions.
2.8. Surprise Index for a Generalized Poisson Distribution
The generalized Poisson distribution is denoted as
with parameters
. To allow us to differentiate between the parameter and the integration variable, we change
to
and then, the p.m.f. is
where
is the number of occurrences. The associated generating function is then, according to [
11],
where
is the Lambert W function. Continuing with the prescribed process, we found the following integral form:
Consequently, the associated SI for a GPD, upon substituting Equation (
17) in the numerator of Equation (
1), will be
Noticeably, from Equation (
18), it can be observed that this integral is difficult to solve in order to obtain a closed and analytically tractable form because of the involvement of the Lambert W function which has both real and imaginary parts. Numerical methods must be adopted, which we have not considered for brevity.
In addition, for illustrative purposes, we have also provided graphs of the SI for several discrete probability distributions discussed in this section in
Appendix B.
3. Surprise Index for Continuous Probability Models
For a continuous random variable (r.v.), the associated expression for the SI is given by [
2] and has the following form:
where
is the r.v. that is the probability density function (p.d.f.) of the original r.v.,
p is a realization of
, and H is a simple statistical hypothesis. Equivalently, we may rewrite the definition as follows. Let
X be a continuous random variable with density function
Then, for all
the SI is given by
However, an alternative version which does involve the geometric expectation (it is termed as a generalization of the SI) is given by
where
stands for the geometric expectation which will be equivalently evaluated using
For computation of the SI for various continuous probability models, we use Equation (
19). In
Table 6, we provide the expression of Equation (
19), which can be viewed as an expression of the SI (according to [
2]) for various univariate absolute continuous distributions. The symbolic computations are all carried out using
Mathematica.
From
Table 6, one can make the following observations for fixed
:
For uniform and b increasing and a decreasing, the SI will increase.
For Beta as a increases and b increases, SI decreases. On the other hand, when both a and b increase, the SI increases.
For Beta (type-II) , when both increase, the SI will increase.
For Pareto (type-II) distribution, because of the nature of the polygamma function as obtained from Mathematica, for any choices of the parameter regardless of the other permissible choices of the other two parameters, it is divergent and, therefore, it cannot be computed.
For the Log-normal distribution, as both and increase, the associated SI increases.
For the Gamma distribution—(i) when is fixed, with increasing, the SI will increase and (ii) with fixed and increasing, the SI will increase.
For the Weibull distribution, the following can be observed:
- –
For a fixed k as and increase, the SI will increase.
- –
For a fixed as k and increase, the SI will increase.
- –
For any choice of and decreasing with k increasing, for a fixed choice of the corresponding SI will decrease.
Next, we make the following conjectures. The proofs seem obvious, but we leave this up to the reader.
Note that in
Appendix A, we provide the
Mathematica codes for computing the SI for both univariate discrete and continuous probability models.
4. Potential Applications and Challenges/Open Problems
The use of Weaver’s SI as an alternative to the use of tail area probabilities was suggested by [
2]. Some applications of the SI have been presented such as determining if certain events are surprising; i.e., being dealt the same hand of cards consecutively in a game of bridge [
1] or a fair coin toss with edges of a particular size landing on its edge when flipped [
1]. Although these applications are interesting, they are not particularly useful. For example, Ref. [
4] suggests using the SI for outlier detection which we find intriguing since detecting outliers can be difficult, and by applying this feature to various data sets, we established the fact that it can be considered another tool for detecting outliers.
The Hermite distribution is used in the distribution of counts of bacteria in leucocytes. We assume that applying the surprise index for this distribution could be useful in determining that the counts of bacteria in white blood cells (leucocytes) are alarmingly high. This information could be helpful in choosing follow-up tests, determining diseases, or expediting patient care for patients who need urgent medical attention.
Several potential challenges in extending this definition in bivariate and higher domains might be summarized as follows:
- (i)
Ref. [
2] states, “for multivariate normal distributions,
the distribution of the likelihood density, does not seem to be expressible in elementary terms” (p. 1133);
- (ii)
The special functions are difficult to determine for the univariate case, which leads to even more difficulty when more variables are considered;
- (iii)
The long runtimes when finding the closed-form expressions for several of such distributions suggest that a multivariate analysis of the SI will require highly efficient computing environments.
5. Concluding Remarks
In this article, we discuss with adequate details, the derivation of the SI for several univariate discrete probability distributions that had not been discussed earlier along with a re-evaluation of the surprise indices for the binomial, Poisson, and the geometric distributions. Using the
Mathematica software, we obtain closed-form expressions for the SI for the binomial, negative binomial, and Poisson distributions including that of the zero-truncated Poisson, geometric, Hermite, and Skellam distributions involving either special functions and/or infinite sums or series. Also, we have computed the SI for univariate continuous probability models via an analogous expression (similar to the discrete case, but not exactly the same), which involves computing the geometric mean of a random variable. Extension to the bivariate and higher dimensions will be the topic of a separate article. However, the SI is not above criticism. For example, it is conjectured that in the definition of the SI, the numerator given in Equations (
1) and (
2) is arbitrary. Furthermore, the value of SI drastically changes when the results of an experiment are lumped together in a different way (discrete case) and/or there is a change in the values of stochastically independent r.v.s in the continuous case.