1. Introduction
Count data represents the number of times a particular event occurs in an interval of time, space, or other unit of measurement. This type of data is commonly found in various areas, such as medicine, economics, and engineering, to name a few. For example, Böhning et al. [
1] analyzed count data from a dental epidemiological study under the situation of additional zeros. Salman et al. [
2] analyzed bankruptcy count data from Swedish small manufacturing firms with the aim of investigating the business failure risk factors of small manufacturing firms. Calabria et al. [
3] analyzed the reliability of repairable systems from in-service failure count data.
There are many real-world scenarios where the probability of success in binomial experiments cannot be considered constant. For example, the probability of consuming alcohol across the 7 days of a particular week varies from one individual to another (see Alanko and Lemmens [
4]). Considering a beta distribution for the probability of success in a binomial distribution (which gives rise to the beta-binomial distribution) is not overly restrictive since the beta distribution is very flexible in terms of the shapes of its probability density function.
A random variable
X follows the beta-binomial distribution, denoted
, if its probability mass function (p.m.f.) is given by
where
,
, is the beta function.
In Bayesian inference, the beta-binomial distribution is used to make predictions about the number of successes in future trials, taking into account the uncertainty in the estimate of the probability of success. In classical inference, the beta-binomial distribution can be used to model data with overdispersion in binomial experiments, i.e., when the observed variability is greater than that expected under a standard binomial distribution.
A review of the applicability and extensions of the beta-binomial distribution can be found in Wilcox [
5]. The use of the beta-binomial distribution in the context of regression is discussed in Crowder [
6]. Details on the estimation of the parameters of the beta-binomial distribution can be found in Tripathi et al. [
7].
Regarding more recent applications of the beta-binomial distribution, several studies can be found in the literature. To name a few, Palm et al. [
8] use the beta-binomial distribution in the formulation of the BBARMA (Beta-Binomial Autoregressive Moving Average) model, which can capture the temporal dynamics and autoregressive structure in count data. Chen et al. [
9] use the beta-binomial distribution to propose a GARCH model that captures the variation in the number of new cases of cryptosporidiosis infection, obtaining a useful model for time series data that present bounded counts and high volatility. Jansen and Holling [
10], under a Bayesian approach, use the beta-binomial distribution in the meta-analysis of rare events.
Although the beta-binomial distribution is applied in various real-world settings, its performance is not good when empirical distributions exhibit bimodality, i.e., when there are two modes or peaks in the empirical distributions. The presence of bimodality can be explained by the existence of two groups or subpopulations with unique characteristics or by the existence of latent variables that significantly influence the distribution of the population.
A very popular methodology in the literature to incorporate flexibility in terms of asymmetry and multimodality is related to the definition of weighted distributions proposed by Fisher [
11] and Rao [
12]. Suppose that
X is a random variable with probability function
. The weighted random variable
has PDF
where
is a nonnegative weight function and
.
A particularly salient case of (
2) is obtained when
, which defines a length-biased distribution. These distributions arise naturally in applied fields, such as reliability and survival analysis, when individuals or mechanical units are sampled with unequal probability due to the experimental design or the existing unequal probability of detection.
On the other hand, it is possible to find in the literature weight functions that can lead to multimodality for the weighted distributions resulting from (
2). For example, if
,
, and
is the pdf of the normal distribution with mean
and variance
, then (
2) reduces to the family of bimodal distributions called the alpha-skew-normal distribution, see Elal-Olivero [
13]. Based on the same weight function, Gómez-Déniz et al. [
14] introduces a bimodal version of the Poisson distribution. Cortés et al. [
15] propose a parametric weight function that involves a power function of exponent 4, which can lead to a probability function with up to three modes.
In this paper, we propose an extension of the beta-binomial distribution appropriate to fit overdispersed binomial data that may exhibit both unimodality and bimodality. The proposal arises from (
2), using the weight function proposed by Elal-Olivero [
13] under a beta-binomial baseline distribution. In this way, the new distribution is aimed at expanding the use of beta-binomial distributions to real-world scenarios where empirical distributions exhibit bimodality.
The remainder of the paper is organized as follows. In
Section 2, we define the bimodal beta-binomial random variable and study some of its properties, such as the probability mass function, cumulative distribution function, and the raw moments. The latter are used to describe the behavior of the relative dispersion with respect to the mean and the skewness behavior of the distribution. In
Section 3, parameter estimation for the new distribution using the maximum likelihood method is discussed. A simulation study is carried out to evaluate the behavior of the estimators. In
Section 4, two application examples with real data are presented to illustrate the usefulness of the proposed distribution. Finally, concluding remarks are presented in
Section 5.
4. Applications
In this section, two applications are presented to illustrate the utility of the bimodal beta-binomial (BBB) distribution in modeling count data. In each application, the beta-binomial (BB) and McDonald generalized beta-binomial (McGBB) [
19] distributions are incorporated into the analysis. The p.m.fs of the McGBB distribution is given by
where
is the beta function.
Like the BB distribution, the McGBB distribution performs well in modeling overdispersed binomial data. However, due to a larger parameter dimension, the McGBB distribution may outperform the BB distribution in modeling overdispersed binomial data. Furthermore, it is important to note that the McGBB distribution can model bimodality when the empirical frequency distribution exhibits a bathtub shape, thus making the BBB distribution a natural alternative to the McGBB distribution.
We assessed the quality of the fits using the chi-square goodness-of-fit test and evaluated the comparative performance using the Akaike information criterion (AIC) [
20] and the Bayesian information criterion (BIC) [
21]. R codes used in this section are provided in
Appendix A.
Furthermore, we use the excess mass test proposed in Ameijeiras-Alonso et al. [
22] to show the bimodality of the data considered in the first application and the unimodality of the data considered in the second application. For this, we used the modetest function [
23] in the R programming language.
4.1. Alcohol Consumption Data
The first data set consists of observations on the number of days on the
days of two reference weeks (week 1 and week 2), in which 399 individuals consume alcohol (See
Table 2). Although there may be an attempt to use the binomial distribution to fit these data, it must be taken into account that the probability of consuming alcohol on a randomly chosen day in a week is variable from one individual to another. Based on the latter, Alanko and Lemmens [
4] use the beta-binomial distribution to fit these data. On the other hand, Manoj et al. [
19] illustrate that these data present an overdispersion with respect to the binomial distribution and that the McGBB distribution performs better than the BB and Kumaraswamy binomial [
24] distributions in fitting these data.
For these data, we test hypothesis : the data have exactly two modes versus the alternative hypothesis : the data have more than two modes. For the data from week 1, we obtain an observed statistic equal to 0.021 with a p-value equal to 0.644. For the data from week 2, we obtain an observed statistic value equal to 0.023, with a p-value equal to 0.406. Therefore, with a significance level equal to 0.05, is not rejected in both weeks; that is, the frequency distributions of the data corresponding to weeks 1 and 2 exhibit bimodal behavior.
Other previous studies with these data can be found in Rodríguez-Avi et al. [
25].
Table 2 reports the results obtained when fitting the alcohol consumption data using the BB, McGBB, and BBB distributions. The table shows that the BBB distribution presents the highest
p-values in the chi-square goodness-of-fit test and the lowest AIC and BIC values, suggesting that the BBB distribution should be selected among the fitted distributions for modeling the alcohol consumption data.
Figure 4 shows the frequency distribution of the alcohol consumption data (Weeks 1 and 2) and the fitted BB, McGBB, and BBB distributions. In the figure, it can be seen that the frequency distributions of the number of drinking days present two frequency peaks and that the mass values of the BBB distribution are the closest to the empirical frequency values.
4.2. Candidate Assessment Data
In this section, we consider a dataset on candidate performance on an exam consisting of 9 questions. Each question is scored out of a total of 20 points, and when assessing a candidate’s final score, special attention is paid to the total number of questions on which he or she has an “alpha” (“alpha”—scoring at least 15 points on the question). The number of alphas is a rough indication of the quality of the candidate’s exam performance. Therefore, the distribution of alphas across candidates is of interest. A total of 209 candidates attempted questions from this 9-question section, and 326 alphas were awarded in total. Thus, we consider (number of trials/questions), where the dichotomous variable on each trial is whether or not the candidate scored an alpha.
For these data, we test the hypothesis : the data has exactly one mode versus the alternative hypothesis : the data have more than one mode, obtaining an observed statistic equal to 0.031, with a p-value equal to 0.720. Consequently, with a significance level equal to 0.05, is not rejected; that is, the frequency distribution of the data exhibits unimodal behavior.
A previous study with these data can be found in Paul [
26].
Table 3 reports the results obtained by fitting the number of alphas using the BB, McGBB, and BBB distributions. In the table, it can be seen that the BBB distribution presents the highest
p-values in the chi-square goodness-of-fit test and the lowest AIC and BIC values, suggesting that the BBB distribution should be selected among the fitted distributions for modeling the number of alphas.
Figure 5 shows the frequency distribution of the number of alphas and the fitted BB, McGBB, and BBB distributions. In the figure, it can be seen that the frequency distribution of the number of alphas presents a single frequency peak and that the mass values of the BBB distribution are the closest to the empirical frequency values.
5. Final Comments
The beta-binomial (BB) and McDonald’s generalized beta-binomial (McGBB) distributions are discrete probability distributions used for modeling overdispersed binomial data. The McGBB distribution, presenting a larger parameter dimension than the BB distribution (three parameters), is capable of modeling even bimodality in cases where empirical frequency distributions present a bathtub shape. In this article, we propose the bimodal beta-binomial (BBB) distribution as an alternative for modeling overdispersed binomial data, both unimodal and bimodal. The new distribution arises from a weighted version of the BB distribution, where the weight function has the quadratic form proposed by Elal-Olivero [
13]. Consequently, the BBB distribution is capable of presenting a flexible probability mass function in terms of shapes: monotonic, unimodal, and even bimodal. The bimodal shape of the BBB distribution is not limited to the bathtub shape (like the McGBB distribution), but the bimodality can be accompanied by various levels of skewness.
We derive the main structural functions of the BBB distribution, such as the p.m.f., the c.d.f., and the raw rth moment. We use the rth moment to describe the behavior of the coefficient of variation and the coefficient of skewness. We observe that the BBB distribution may exhibit a larger relative dispersion and a larger skewness than the BB distribution. We discuss parameter estimation via the maximum likelihood (ML) method. The estimators are not closed-form, so numerical methods are required to obtain the estimates. We develop a simulation study to evaluate the behavior of the ML estimators, in which we observe that the ML method provides acceptable estimates. Finally, we illustrate the utility of the BBB distribution by fitting real data sets. The illustrations show that the BBB distribution can outperform the BB and McGBB distributions in modeling count data that exhibit both unimodality and bimodality.