Next Article in Journal
Exergoeconomic Analysis and Optimization of a Biomass Integrated Gasification Combined Cycle Based on Externally Fired Gas Turbine, Steam Rankine Cycle, Organic Rankine Cycle, and Absorption Refrigeration Cycle
Previous Article in Journal
Violations of Hyperscaling in Finite-Size Scaling above the Upper Critical Dimension
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Bayesian Measure of Model Accuracy

by
Gabriel Hideki Vatanabe Brunello
and
Eduardo Yoshio Nakano
*,†
Department of Statistics, University of Brasília, Campus Darcy Ribeiro, Asa Norte, Brasília 70910-900, Brazil
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Entropy 2024, 26(6), 510; https://doi.org/10.3390/e26060510
Submission received: 1 April 2024 / Revised: 3 June 2024 / Accepted: 10 June 2024 / Published: 12 June 2024
(This article belongs to the Section Multidisciplinary Applications)

Abstract

:
Ensuring that the proposed probabilistic model accurately represents the problem is a critical step in statistical modeling, as choosing a poorly fitting model can have significant repercussions on the decision-making process. The primary objective of statistical modeling often revolves around predicting new observations, highlighting the importance of assessing the model’s accuracy. However, current methods for evaluating predictive ability typically involve model comparison, which may not guarantee a good model selection. This work presents an accuracy measure designed for evaluating a model’s predictive capability. This measure, which is straightforward and easy to understand, includes a decision criterion for model rejection. The development of this proposal adopts a Bayesian perspective of inference, elucidating the underlying concepts and outlining the necessary procedures for application. To illustrate its utility, the proposed methodology was applied to real-world data, facilitating an assessment of its practicality in real-world scenarios.

1. Introduction

For effective decision making, a thorough understanding of the problem is crucial. However, this understanding often requires dealing with a significant amount of data, which, due to their volume, present complex relationships. In such scenarios, recognizing crucial data relationships may not be straightforward, necessitating the application of analytical methodologies. Statistical modeling stands as a valuable asset in this context, streamlining complex events through the lens of hypothetical probabilistic models. These models find validation through rigorous empirical observation, enhancing their reliability and utility. Therefore, it is essential to verify that the chosen model adequately represents the problem of interest. Failure to specify the model correctly can compromise the quality of information obtained, leading to inaccuracies, and ultimately, erroneous conclusions. Various methods exist to evaluate the quality of a model, but most involve subjective classification criteria or complex elaboration, deterring their use in practical applications. Hence, this work will introduce a proposal for a Bayesian methodology to evaluate the quality of a statistical model based on its predictive ability. This means assessing the model’s effectiveness in predicting values for new instances of the problem at hand. The advantage of this proposal lies in its simplicity. By focusing on the model’s predictive capacity, it does not solely rely on its fit to existing data. This approach streamlines its application and promotes its use in decision-making scenarios.
The proposal outlined in this work is a modification of an external validation approach proposed by [1], which lacks an objective criterion for assessing the model’s quality. Additionally, this methodology shares a similar logic to the Log Pseudo Marginal Likelihood (LPML) [2], but with distinct objectives. Whereas the LPML compares models, the aim of this work’s proposal is to determine whether a model can accurately predict a new observation. The behavior of the accuracy measure was examined through simulated applications in generalized linear models with exponential distribution.
The goals of this work were to introduce a proposal for a Bayesian methodology for assessing model adequacy based on its predictive ability, investigate the performance of this methodology in generalized linear models with exponential distribution, and devise a straightforward criterion for the methodology to assess the quality of a model. The proposed classification criterion was derived from simulated data and was demonstrated using a real dataset from the literature. All simulations and analyses were conducted using the open-source software R [3].

Assessment of the Quality of Statistical Models

The assessment of model quality is an area with extensive literature within the frequentist paradigm, with numerous techniques available for objective evaluation. For example, D’Agostino’s book [4] provides an overview of the most important classical goodness-of-fit tests. Conversely, within the Bayesian framework, the literature is relatively recent, and the existing methods are often restrictive or complex, resulting in this fundamental aspect of statistical analysis occasionally being neglected.
In a Bayesian context, evaluating the quality of a model does not rely on the adequacy of the likelihood function used, unlike the classical approach. Instead, it depends on the suitability of the posterior distribution, as any relevant inference for the problem stems from it. Additionally, some authors propose that the quality of a Bayesian model should be judged based on its predictive distribution. If the data do not align with the predictive distribution, it is anticipated that the model is not appropriate [5].
Furthermore, additional methods for evaluating the quality of a model in a Bayesian framework are discussed in [6]. Some of the commonly used techniques include Dirichlet Processes [7], posterior predictive check [1], Log Pseudo Marginal Likelihood [2], leave-one-out (LOO) cross-validation, and the Widely Applicable Information Criterion (WAIC) [8].
Dirichlet Processes [7] are utilized in estimating a non-parametric Bayesian model, which is subsequently compared to the proposed model using the Bayes Factor [9] to assess if their difference is significant. This method serves as a model fitting technique that employs the difference between the values estimated by the proposed model and those by the non-parametric model as a quality criterion. Nonetheless, it is associated with the drawback of necessitating a complex process for its development.
The posterior predictive check [1] assesses whether any T statistic from the model is consistent with the empirically observed data. This method entails a dual use of the data, as they are utilized both in the model estimation process and for comparison with the test statistic. The subjectivity in defining the T statistic is a notable critique of this approach, as it must be adapted to each specific problem.
The leave-one-out (LOO) cross-validation and the Widely Applicable information Criterion (WAIC) [8] are methods that estimate pointwise out-of-sample prediction accuracy. According to Vehtari et al. [10], these methods were less used in practice because they involve additional computational steps. In order to mitigate this problem, they presented an optimized computation method for LOO using Pareto-smoothed importance sampling.
The Log Pseudo Marginal Likelihood (LPML) method [2] involves assessment using the Conditional Predictive Ordinate, which represents the predictive density of an observation in the estimated model without it. This approach selects models based on their predictive capacity, computing a statistic that indicates the optimal model to use. However, the obtained statistic does not enable us to determine whether the utilized model is a good fit; rather, it only indicates if it is superior to the others with which it was compared.
Gelman et al. [1] proposed another method based on external validation in which a predictive interval of probability 0.5 is computed for observations not utilized in the modeling process. This involves assessing the number of observations falling within these intervals, as it should closely align with the defined 50 % credibility ([1], p. 142). Despite its intuitive nature, this approach is not widely adopted due to its subjective rejection criterion, which can vary based on the user’s perspective on quality assessment. In this context, this work aims to adapt the external validation methodology proposed by [1], as it provides an intuitive approach to assessing the accuracy of a model. To achieve this goal, adjustments will be made to certain steps of the method, facilitating the establishment of an objective criterion for evaluating the model’s quality.

2. Proposal for Analysis of Predictive Capacity

This study proposes an adaptation of the external validation approach suggested by [1] to evaluate a model’s quality through the predictive capacity of its posterior distribution. The use of the posterior distribution ensures the suitability of the final model, as methods that solely assess the likelihood function may not ensure the appropriateness of the prior distribution used, potentially compromising the final model’s outcomes. The proposed method entails employing the leave-one-out (LOO) technique to assess the model’s ability to accurately predict new observations. The procedure consists in calculating the proportion of correctly predicted values and uses it as a quality statistic. It checks whether the observed value is feasible given the chosen credible level and rejects the model when the observed proportion is unlikely. This idea is consistent with the emphasis on model prediction analysis that is common to both schools of thought of 20th century statistics (see [11] for more details) and can be efficiently implemented in a great variety of statistical models.

2.1. The Accuracy Measure

Let C i be a credible interval for the predicted value considering a fitted model without observation i from the sample, where i = 1 , 2 , , n . If the value y i falls within the predicted interval, it is classified as a correct prediction ( u i = 1 ); otherwise, it is classified as an error ( u i = 0 ), i.e.,
u i = 1 , y i C i 0 , y i C i , i = 1 , 2 , , n .
Thus, the proportion of correct predictions is given by
κ = i = 1 n u i n .
The LOO technique prevents the double use of data, unlike the Posterior Predictive Check. Moreover, employing interval estimators simplifies specifying an expected proportion of accurate predictions for the model, as for a γ × 100% credible interval, this proportion should be close to γ , 0 < γ < 1. Consequently, a κ value far from γ suggests the model lacks good predictive capacity and is not suitable for representing the problem. Therefore, the proposed accuracy measure is determined by the difference between the proportion of accurate predictions and the credible level of the interval
Δ = κ γ .
The value of Δ ranges from γ to 1 γ , and a value of Δ = 0 indicates a good model accuracy. It is important to note that a proportion of correct predictions significantly higher than the credibility used ( Δ > 0 ) is not beneficial, as it indicates imprecision in the predictive interval. Alternatively, the more negative the value of Δ , the stronger the indication that the model has low predictive capacity. The proposed method shares a similar rationale with the Log Pseudo Marginal Likelihood (LPML), but it serves different objectives. Whereas the LPML is used for model comparisons, the aim here is to determine if a model can effectively predict a new observation.

2.2. Decision Criterion

We can construct a hypothesis test for the methodology, providing an objective approach to determine whether there is evidence that the model used lacks good predictive capability for the given problem. Consider the following hypotheses:
H : The model has good predictive capability . H a : The model does not have good predictive capability .
Hypothesis (4) can be tested using Bayesian hypothesis testing (see, for example, [12,13]) to determine if the proportion of correct predictions, κ , is equal to the credibility γ (i.e., Δ = 0 ). Thus, hypothesis (4) can be reformulated as:
H : κ = γ H a : κ γ .
Assuming a prior distribution κ B e t a ( a 1 , a 2 ) and u i | κ B e r n o u l l i ( κ ) , we obtain the posterior distribution of the proportion of correct predictions, given the observations, as κ | u B e t a ( A 1 , A 2 ) , where A 1 = a 1 + i = 1 n u i and A 2 = a 2 + n i = 1 n u i . Here, u i ( i = 1 , 2 , , n ) is given by Equation (1). Moreover, hypothesis (5) can be tested using the evidence value (e-value) of a Full Bayesian Significance Test (FBST) [12]. The e-value for testing hypothesis (5) can be obtained through Monte Carlo simulation following the steps of Algorithm 1.
Algorithm 1: Obtaining the e-value to test hypothesis (5).
  • Generate κ 1 , κ 2 , , κ M from a B e t a ( A 1 , A 2 ) distribution with parameters A 1 = a 1 + i = 1 n u i and A 2 = a 2 + n i = 1 n u i ;
  • Calculate the posterior density under H : f H ( γ ) = 1 B ( A 1 , A 2 ) γ A 1 1 ( 1 γ ) A 2 1 ;
  • Calculate the posterior density: f ( κ m ) = 1 B ( A 1 , A 2 ) κ m A 1 1 ( 1 κ m ) A 2 1 , m = 1 , , M ;
  • If f ( κ m ) f H ( γ ) , set v m = 1 , m = 1 , , M ;
  • Calculate the e-value: e v a l u e = 1 M m = 1 M v m .
Note: In Steps 2 and 3, B ( A 1 , A 2 ) = 0 1 z A 1 1 ( 1 z ) A 2 1 d z is the beta function.
Below, we provide R code for obtaining the e-value to test hypothesis (5).

 # Kappa_h = Kappa value under H
 # M       = Number of Monte Carlo replicates
 # a, b    = Prior hyperparameters
 # n       = Sample size
 # u       = Number of correct predictions
 Kappa     <- rbeta(M,a+u,b+n-u)
 f_post    <- dbeta(Kappa,a+u,b+n-u)
 f_post_h  <- dbeta(Kappa_h,a+u,b+n-u)
 e.value   <- sum(f_post<=f_post_h)/M
 e.value
		
In this work, we opted for the level γ = 0.5 since it results in symmetry in the lower and upper deviations. Note that this symmetry does not hold for γ 0.5 . For γ = 0.95 , the situation where the proportion of correct predictions is less than the credible level ( Δ < 0 ) is less concerning than when the proportion of correct predictions is greater than the credible level ( Δ > 0 ).
According to the FBST, hypothesis H is rejected, meaning the proportion of correct predictions is different from 0.5 (or the model does not exhibit good predictive capability), if e-value < α . Here, α is the “critical value” fixed or obtained from elicited loss functions [14].
Alternatively, according to the methodology outlined in this work, we reject the null hypothesis H if | Δ o b s |   > Δ c r i t i c a l , where Δ c r i t i c a l depends on the critical value α and the sample size n.
To establish the critical points for the rejection criterion, samples ranging from n = 10 to 500 were generated. To determine the critical points for other values of n, a least squares regression was performed for the errors ξ = | Δ | using the square root of the sample size as the explanatory variable. Notice that the adopted value γ = 0.5 results in symmetry in the error ξ . This regression was adjusted to allow interpolation and extrapolation for n > 40 . The regression model adopted was ξ = β 1 n . The estimated parameters for the regression curves with α = 0.01, 0.05, 0.1, and 0.2 were, respectively, β 1 = 1.261, β 1 = 0.966, β 1 = 0.812, and β 1 = 0.633. The values of Δ c r i t i c a l for α = 0.01, 0.05, 0.1, and 0.2 were obtained from the FBST procedure considering the B e t a ( 1 , 1 ) as prior distribution and M = 1,000,000 Monte Carlo replicates. Figure 1 displays the curve fits for the errors associated with sample size for different values of α . These graphs demonstrate satisfactory adjustments, indicating that the regression equations aptly represent the errors.
Table 1 presents the values of Δ c r i t i c a l for n = 10 to 40 as well as its approximation for n > 40 .

3. Two Simple Examples

3.1. Exponential Distribution

Consider, for example, the exponential distribution, widely used in fields such as health and reliability. This distribution was chosen for its single parameter, which simplifies the comprehension of the proposed methodology. Let X 1 , X 2 , , X n be a sample from X, which follows an exponential distribution with mean 1 θ , i.e., X | θ E x p o n e n t i a l ( θ ) . Assuming a priori θ G a m m a ( a , b ) , a , b > 0 , we obtain the posterior distribution θ | X G a m m a ( a + n , b + i = 1 n x i ) . Thus, the predictive density function of a new observation Y | X is given by:
f Y | X ( y ) = ( a + n ) ( b + i = 1 n x i ) a + n ( y + b + i = 1 n x i ) a + n + 1 , y > 0 .
Therefore, the quantile q of Y | X is
y q = b + i = 1 n x i ( 1 q ) 1 a + n 1 ,
resulting in the following equal-tailed γ × 100% credible interval for the predicted value y, given the sample X 1 , X 2 , , X n :
C I γ × 100 % : b + i = 1 n x i 1 γ 2 1 a + n 1 ; b + i = 1 n x i γ 2 1 a + n 1 .
The percentage of correct predictions and the proposed accuracy measure in this work can be obtained through the steps outlined in Algorithm 2.
Algorithm 2: Obtaining the accuracy measure Δ .
  • Set i = 1 ;
  • Create sample S i by removing observation i from the complete dataset;
  • From S i , obtain the credible interval C i for a new observation;
  • Check if the observation i, removed from the sample, lies within the predicted interval:
    (a)
    If the observation i lies within the credible interval, set u i = 1 ;
    (b)
    If the observation i does not lie within the credible interval, set u i = 0 ;
  • If i < n , set i = i + 1 and return to Step 2;
  • Calculate the proportion of correct predictions, κ , using Equation (2);
  • Calculate the accuracy measure, Δ , using Equation (3).
In situations where obtaining the predictive distribution is not feasible, it can be numerically estimated using MCMC—Markov Chain Monte Carlo [15]. To obtain a numerical approximation of the credible interval mentioned in Step 3, the following procedure can be used:
  • For j = 1 , , J , generate θ [ j ] from the posterior distribution of θ | X .
  • For each value of θ [ j ] , generate y i [ j ] E x p o n e n t i a l ( θ [ j ] ) . Thus, y i [ 1 ] , y i [ 2 ] , , y i [ J ] is a sample from the predictive distribution (6).
  • The limits of the equal-tailed credible interval for a new observation y i are given by the quantiles γ 2 and (1 – γ 2 ) of y i [ 1 ] , y i [ 2 ] , , y i [ J ] . Alternatively, the HPD (highest posterior density) interval can be obtained from y i [ 1 ] , y i [ 2 ] , , y i [ J ] by using the e m p . h p d command from the T e a c h i n g D e m o s package in R [3].
A drawback of the proposed method is its high computational cost, as it requires estimating a model for each observation in the sample, which can be inefficient for large datasets. In such situations, ref. [10] presented an optimized computation method for LOO using Pareto-smoothed importance sampling. This method effectively manages importance weights and is conveniently implemented in the loo package within the R programming environment [3].
Figure 2 depicts data generated from a sample of size n = 100 from an exponential distribution and its predictive intervals (HPD and equal-tailed) with 50 % credibility ( γ = 0.5 ) calculated from Equation (6). The analysis was performed considering a diffuse prior G a m m a ( a = 100 1 , b = 100 1 ) for θ . In an asymmetric distribution, as is the case of Equation (6), the HPD and equal-tailed intervals will present distinct regions despite having the same credibility (Figure 2). The fact that the HPD interval does not contain the mean is due to the used credibility and the asymmetry of the predictive distribution, as the HPD interval is dependent on the mode rather than the mean of the distribution. In this example, the exponential distribution exhibited a good predictive fit, with the proportions of correct predictions being 47 % and 53 % for the 50 % equal-tailed and HPD credible intervals, respectively. Note that both types of intervals resulted in 0.030 = | Δ o b s | < Δ c r i t i c a l = 0.966 n = 0.097 (Table 1; n = 100; α = 0.05), which leads to non-rejection of the hypothesis that the exponential model has good capacity to predict future data. For observed accuracy rate κ = 0.47 (and κ = 0.53), the FBST yielded an e-value of 0.545, also leading to non-rejection of the hypothesis for α = 0.05.

3.2. Poisson Distribution

Let X 1 , X 2 , , X n be a sample from X, which follows a Poisson distribution with mean θ , i.e., X | θ P o i s s o n ( θ ) . Assuming a priori θ G a m m a ( a , b ) , a , b > 0 , we obtain the posterior distribution θ | X G a m m a ( a + i = 1 n x i , b + n ) . Thus, the predictive distribution of a new observation Y | X is given by a Gamma–Poisson distribution (with parameters A = a + i = 1 n x i and B = b + n ):
f Y | X ( y ) = Γ ( a + i = 1 n x i + y ) Γ ( a + i = 1 n x i ) Γ ( y + 1 ) ( b + n ) a + i = 1 n x i ( b + n + 1 ) a + i = 1 n x i + y , y = 0 , 1 , .
Therefore, the lower and upper limits of the equal-tailed γ × 100% credible interval for the predicted value y, given the sample X 1 , X 2 , , X n can, respectively, be obtained by L 1 = sup { y : F Y | X ( y ) γ 2 } and L 2 = inf { y : F Y | X ( y ) 1 γ 2 . } , where F Y | X ( y ) is the cumulative predictive distribution given by F Y | X ( y ) = k = 0 y f Y | X ( k ) .
As an example, consider that Y G a m m a - P o i s s o n ( 502 , 30 ) . In this case, the limits of the equal-tailed 50% credible interval are given by L 1 = 13 and L 2 = 19. Figure 3 presents the cumulative distribution function of Y.
It is important to emphasize that since the predictive distribution is discrete, the interval may not have credibility (exactly) equal to γ . In fact, the real credibility of the interval will be greater than or equal to γ . Therefore, the test proposed in this paper will be approximate in cases where the predictive distribution is discrete. An alternative in these cases is to consider in hypothesis test (5) the average credibility of each of the n intervals obtained in the LOO steps.
Figure 4 depicts data generated from a sample of size n = 30 from a Negative Binomial distribution and its predictive equal-tailed intervals with 50 % credibility estimated by a Poisson model. The analysis was performed considering a diffuse prior G a m m a ( a = 100 1 , b = 100 1 ) for θ . As expected, the Poisson model exhibited a poor predictive fit, with the proportion of correct predictions being 30% for the 50 % equal-tailed credible intervals. This observed proportion of correct predictions resulted in 0.2 = | Δ o b s | > Δ c r i t i c a l = 0.183 (Table 1; n = 30; α = 0.05), which leads to rejection of the hypothesis that the Poisson model has good capacity to predict future data. For observed accuracy rate κ = 0.3, the FBST yielded an e-value of 0.023, also leading to rejection of the Poisson model for α = 0.05. In addition, the FBST of hypothesis (5) considering κ * = 0.632 (the average credibility of each n = 30 intervals obtained in the LOO steps) yielded e-value < 0.001, also leading to rejection of the Poisson model for α = 0.05.

4. Simulation Study

In this section, we present a simulation study to verify whether factors such as the nature of the covariates (numeric or categorical), the number of model parameters, or the sample size used for estimation could potentially influence the behavior of the proportion of correct predictions, κ , making it crucial to investigate their effects in determining the critical value. To assess which of these factors truly impact the value of κ , simulation studies were conducted using exponential regression models. To examine the effects of possible interactions among the factors, samples of size n were simulated, considering four scenarios of parameters with one to five predictors each, resulting in twenty distinct scenarios. For each of these scenarios, 1000 samples were generated, totaling 20,000 samples. The methodology was then applied to each of these generated samples. The simulated values of n ranged from 10 to 40, 50, 60, , 140, and 150. Given the ease of obtaining and interpreting results, this study will solely use the equal-tailed interval to define the κ value. Therefore, simulations will be conducted exclusively with equal-tailed intervals. The flowchart depicted in Figure 5 illustrates the structure used in the simulation, along with the scenarios of parameters utilized.
The scenarios were chosen to maximize the diversity of parameter values and covariates used. Below are the four scenarios considered:
S 1 : X T β = 0.7 + 1.1 X 1 0.6 X 2 + 0.2 X 3 + X 4 1.5 X 5
S 2 : X T β = 1 1.3 X 1 + 0.4 X 2 0.2 X 3 + 0.9 X 4 0.3 X 5
S 3 : X T β = 0.3 + 0.7 X 1 1.2 X 2 + 1.1 X 3 0.7 X 4 + X 5
S 4 : X T β = 1.7 0.8 X 1 + 0.1 X 2 + 0.6 X 3 0.8 X 4 1.1 X 5
In these scenarios, X 1 U n i f o r m ( a = 0 , b = 5 ) , X 2 B e r n o u l l i ( p = 0.2 ) , X 3 N o r m a l ( μ = 0 , σ 2 = 1 ) , X 4 B e r n o u l l i ( p = 0.7 ) and X 5 U n i f o r m ( a = 0 , b = 5 ) .
In Figure 6 and Figure 7, we can observe the mean and standard deviation of the κ values for each of the M = 20,000 Monte Carlo replicates across the four considered scenarios of factors. It is noticeable that in small sample sizes, significant disparities were observed among models with varying numbers of covariates. Simulations with higher numbers of covariates exhibited higher means and deviations compared to others. This outcome is expected due to model saturation with small samples, attempting to estimate numerous parameters with limited observations, resulting in lower predictive capacity of the adjusted model. However, as the sample size increases, differences based on the number of covariates diminish, and all models converge to the same value in terms of both mean and standard deviation. It is evident that when the model has fewer covariates, approximately less than 20 % of the sample size, the value of κ is not affected by this factor.
The various scenarios of parameters used did not affect the value of κ , as it showed well-distributed values across the simulations. This suggests that the types of covariates do not significantly influence the model’s accuracy percentage. Another expected outcome is the convergence of the standard deviation of the proportion of correct predictions to zero, with the mean converging to 0.5. This occurs because as the sample size increases, there is a greater concentration of correct predictions around the chosen credible level.
To assess whether all simulated proportions of correct predictions exhibited symmetrical behavior, the skewness coefficient was calculated for the number of covariates, type of combination, and sample size. Figure 8 presents the results of the skewness coefficients calculated for the simulations. It can be observed that all values cluster close to 0, indicating evidence of symmetry in all simulated κ values. The slight fluctuations around zero are a result of the number of simulations conducted in each scenario. Nonetheless, values ranging from −0.25 to 0.25 are very close to symmetrical behavior and can be approximated without sacrificing accuracy.
Based on the results from the cross-simulations, it was determined that only the sample size factor has an impact on the proportion of correct predictions, κ . As a result, the sample size, n, will solely be used to establish the model rejection criterion.
Figure 9 displays the average and standard deviation of the simulated accuracy proportions, κ , based on the sample size. Each data point on the graph represents 20,000 simulations, enhancing the precision of the estimates.
The average values of κ are consistently centered around 0.5, reflecting the chosen credible level. Additionally, as the sample size increases, the standard deviation tends to zero, as seen in the previous results. Skewness was also calculated for the aggregated proportion of correct predictions based solely on the sample size, with the results displayed in Figure 10. These coefficients are very close to 0, indicating equal-tailed distributions.
The findings from this simulation study indicate that only the sample size factor needs to be considered in formulating the rejection criterion, showing that the critical points presented in Table 1 are valid regardless of the type and number of explanatory variables in the model.

5. Illustrative Example

The Leukemia dataset, as presented by [16], contains information on the time of death (in weeks) and the white blood count for two groups of leukemia patients, totaling 33 observations. The data are presented in Table 2.
In this application, the model proposed was an exponential regression with parameter θ = e ( β 0 + β 1 × W B C + β 2 × A G ) , where W B C represents the quantity of white blood cells (measured in units of 10,000) and A G denotes the presence of Auer rods and/or significant granulation of the leukemic cells in the bone marrow at the time of diagnosis (AG Present = 1 and AG Absent = 0). The proposed methodology was applied to the Leukemia dataset using a diffuse prior N ( μ = 0 , σ 2 = 100 2 ) for β 0 , β 1 , and β 2 . This involved generating 1,000,000 samples with thinning interval of 5 and a burn-in of 10,000 in the MCMC process. The results from the LOO technique for assessing predictive capacity using this methodology are presented in Table 3 and Figure 9.
From Figure 11, it is evident that the predictive capacity for individuals with AG Present was unsatisfactory, as a significant number of points lie outside the 50% credible interval. This indicates a potential poor fit of the model to the data. In this application, the observed accuracy rate was κ = 11 33 = 0.333 (Table 3), resulting in Δ o b s = 0.333 0.5 = 0.167 . Referring to Table 1 for n = 33 , we find Δ c r i t i c a l = 0.152 for α = 0.05 . Thus, with 95% credibility, we reject the hypothesis that the exponential model used has good predictive capacity for the problem, since | Δ o b s |   > Δ c r i t i c a l .
For observed accuracy rate κ = 0.333 (and n = 33), the FBST yielded an e-value of 0.048 , also leading to the rejection of the hypothesis for α = 0.05. It is noteworthy that both decision criteria fell into regions that reject the hypothesis, demonstrating suitability in the decision criterion based on critical values for Δ presented in Table 1.
It is important to mention that the choice of prior distribution impacts directly on the posterior distribution and its misspecification can result in low model accuracy. As an example, consider changing the prior distribution of β 0 to an informative prior N ( μ = 0 , σ 2 = 1 ) and keeping the same diffuse prior N ( μ = 0 , σ 2 = 100 2 ) for β 1 and β 2 . With this change, we obtain an observed accuracy rate of κ = 10 33 = 0.303 , resulting in Δ o b s = 0.197 . This result indicates that the proposed accuracy measure identified the loss of accuracy due to misspecification of the prior distribution.
In addition, a measure usually used to assess the accuracy of a model is the Root Mean Squared Error (RMSE), defined by R M S E = 1 n Σ i = 1 n ( Y i Y ^ i ) 2 . Here, Y ^ i is the point estimate of the predicted value of individual i, i = 1 , 2 , , n , defined as the mean of the predictive distribution. When considering the diffuse prior N ( μ = 0 , σ 2 = 100 2 ) for β 0 , β 1 , and β 2 (results in Table 3), we obtain R M S E = 40.290. However, when replacing the prior distribution of β 0 by a misspecified informative prior N ( μ = 0 , σ 2 = 1 ) , the value increased to R M S E = 122.303. When comparing the accuracy of the two models using RMSE, it is clear that the first model is more accurate. However, the RMSE fails to identify that even the “more accurate model” does not present a good predictive capacity for the data in Table 2, as shown by the accuracy measure proposed in this paper.

6. Discussion

This study presented an adaptation of a methodology based on external validation proposed by [1], which, despite its simplicity and intuitiveness, lacked an objective way to validate models. The adaptation enabled the definition of an accuracy measure following the model rejection criterion, providing an objective way to validate models. Previously, discrimination could vary depending on the researcher’s perspective. The development of this proposal was carried out from a Bayesian perspective of inference, elucidating the concepts used in its formulation and outlining the necessary steps for its application.
The decision criterion was defined from the FBST (Full Bayesian Significance Test) procedures. The conducted simulation study based on generalized linear models with an exponential distribution indicated that the proposed accuracy measure depends solely on the sample size.
The application of this methodology to real data allowed us to confirm its ease of use and its ability to identify when a model lacks good predictive capability. With its intuitive simplicity, ease of implementation, and low complexity, it is believed that the methodology proposed in this study will become an attractive alternative for model evaluation. This may encourage more research in this field, given the promising results obtained.

Author Contributions

Conceptualization, G.H.V.B. and E.Y.N.; methodology, G.H.V.B. and E.Y.N.; software, G.H.V.B. and E.Y.N.; validation, G.H.V.B. and E.Y.N.; formal analysis, G.H.V.B. and E.Y.N.; investigation, G.H.V.B. and E.Y.N.; resources, G.H.V.B. and E.Y.N.; data curation, G.H.V.B. and E.Y.N.; writing—original draft preparation, G.H.V.B. and E.Y.N.; writing—review and editing, G.H.V.B. and E.Y.N.; visualization, G.H.V.B. and E.Y.N.; supervision, E.Y.N.; project administration, E.Y.N.; funding acquisition, E.Y.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brazil (CAPES)—Finance Code 001, Fundação de Apoio à Pesquisa do Distrito Federal (FAPDF)—TOA 531/2023, Editais de Auxílio Financeiro DPI/DPG/UnB, DPI/DPG/BCE/UnB and PPGEST/UnB.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are from [16] and are also included in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CICredible Interval
FBSTFull Bayesian Significance Test
HPDHighest Posterior Density
LOOLeave-One-Out
LPMLLog Pseudo Marginal Likelihood
MCMCMarkov Chain Monte Carlo
RMSERoot Mean Squared Error
WAICWidely Applicable Information Criterion

References

  1. Gelman, A.; Carlin, J.B.; Stern, H.S.; Dunson, D.B.; Vehtari, A.; Rubin, D.B. Bayesian Data Analysis, 3rd ed.; CRC Press: New York, NY, USA, 2014. [Google Scholar]
  2. Chen, M.-H.; Shao, Q.-M.; Ibrahim, J.G. Bayesian variable selection and computation for generalized linear models with conjugate priors. Bayesian Anal. 2008, 3, 585–614. [Google Scholar] [CrossRef] [PubMed]
  3. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2023; Available online: https://www.R-project.org/ (accessed on 1 March 2024).
  4. D’Agostino, R.B. Goodness-of-Fit-Techniques; CRC Press: Boca Raton, FL, USA, 1986; Volume 68. [Google Scholar]
  5. Paulino, C.D.M.; Turkman, M.A.A.; Murteira, B. Estatística Bayesiana; Fundação Calouste Gulbenkian: Lisboa, Portugal, 2003. [Google Scholar]
  6. Vehtari, A.; Ojanen, J. A survey of bayesian predictive methods for model assessment, selection and comparison. Stat. Surv. 2012, 6, 142–228. [Google Scholar] [CrossRef]
  7. Ferguson, T.S. A bayesian analysis of some nonparametric problems. Ann. Stat. 1973, 1, 209–230. [Google Scholar] [CrossRef]
  8. Watanabe, S. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J. Mach. Learn. Res. 2010, 11, 3571–3594. [Google Scholar]
  9. Kass, R.E.; Raftery, A.E. Bayes factors. J. Am. Stat. Assoc. 1995, 90, 773–795. [Google Scholar] [CrossRef]
  10. Vehtari, A.; Gelman, A.; Gabry, J. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat. Comput. 2017, 27, 1413–1432. [Google Scholar] [CrossRef]
  11. Stern, J.M. Karl Pearson and the logic of science: Renouncing causal understanding (the bride) and inverted Spinozism. S. Am. J. Log. 2018, 4, 219–252. [Google Scholar]
  12. Pereira, C.A.B.; Stern, J.M. Evidence and credibility: Full Bayesian significance test for precise hypotheses. Entropy 1999, 1, 99–110. [Google Scholar] [CrossRef]
  13. Pereira, C.A.B.; Nakano, E.Y.; Fossaluza, V.; Esteves, L.G.; Gannon, M.A.; Polpo, A. Hypothesis tests for Bernoulli experiments: Ordering the sample space by bayes factors and using adaptive significance levels for decisions. Entropy 2017, 19, 696. [Google Scholar] [CrossRef]
  14. Pereira, C.A.B.; Stern, J.M.; Wechsler, S. Can a significance test be genuinely bayesian? Bayesian Anal. 2008, 3, 79–100. [Google Scholar] [CrossRef]
  15. Hammersley, J.; Handscomb, D. Monte Carlo Methods, 1st ed.; Fletcher & Son: Great Ayton, UK, 1964. [Google Scholar]
  16. Hand, D.J.; Daly, F.; Lunn, A.D.; McConway, K.J.; Ostrowski, E. A Handbook of Small Data Sets, 1st ed.; CRC Press: New York, NY, USA, 1994. [Google Scholar]
Figure 1. Regression curves of the error ξ .
Figure 1. Regression curves of the error ξ .
Entropy 26 00510 g001
Figure 2. Observed values and 50% credible intervals of a new predicted observation in exponential model.
Figure 2. Observed values and 50% credible intervals of a new predicted observation in exponential model.
Entropy 26 00510 g002
Figure 3. Cumulative distribution function of Gamma–Poisson distribution with lower and upper limits of the 50% credible interval. The real credibility of the interval is 60.2%.
Figure 3. Cumulative distribution function of Gamma–Poisson distribution with lower and upper limits of the 50% credible interval. The real credibility of the interval is 60.2%.
Entropy 26 00510 g003
Figure 4. Observed values and 50% credible intervals (region in red) of a new predicted observation in Poisson model. The average credibility of the intervals is 63.2%.
Figure 4. Observed values and 50% credible intervals (region in red) of a new predicted observation in Poisson model. The average credibility of the intervals is 63.2%.
Entropy 26 00510 g004
Figure 5. Simulation structure.
Figure 5. Simulation structure.
Entropy 26 00510 g005
Figure 6. Average proportion of correct predictions, κ , by scenario and number of covariates.
Figure 6. Average proportion of correct predictions, κ , by scenario and number of covariates.
Entropy 26 00510 g006
Figure 7. Standard deviation of the proportion of correct predictions, κ , by scenario and number of covariates.
Figure 7. Standard deviation of the proportion of correct predictions, κ , by scenario and number of covariates.
Entropy 26 00510 g007
Figure 8. Skewness of κ by number of covariates.
Figure 8. Skewness of κ by number of covariates.
Entropy 26 00510 g008
Figure 9. Average and standard deviation of κ by sample size.
Figure 9. Average and standard deviation of κ by sample size.
Entropy 26 00510 g009
Figure 10. Skewness of κ by sample size.
Figure 10. Skewness of κ by sample size.
Entropy 26 00510 g010
Figure 11. The 50% credible intervals obtained by the LOO technique.
Figure 11. The 50% credible intervals obtained by the LOO technique.
Entropy 26 00510 g011
Table 1. Critical values of Δ for γ = 0.5.
Table 1. Critical values of Δ for γ = 0.5.
n Δ critical
α = 0.01 α = 0.05 α = 0.10 α = 0.20
100.3500.2500.2500.150
110.3640.2730.2730.182
120.3750.2920.2080.208
130.3080.2310.2310.154
140.3210.2500.1790.179
150.3330.2670.2000.133
160.2810.2190.2190.156
170.2940.2350.1760.176
180.3060.1940.1940.139
190.2630.2110.1580.158
200.2750.2250.1750.125
210.2860.1900.1900.143
220.2500.2050.1590.114
230.2610.2170.1740.130
240.2290.1880.1460.146
250.2400.2000.1600.120
260.2500.1730.1730.135
270.2220.1850.1480.111
280.2320.1960.1610.125
290.2410.1720.1380.103
300.2170.1830.1500.117
310.2260.1610.1290.097
320.2340.1720.1410.109
330.2120.1520.1520.121
340.2210.1620.1320.103
350.2000.1710.1430.114
360.2080.1530.1250.097
370.2160.1620.1350.108
380.1970.1450.1180.092
390.2050.1540.1280.103
400.1880.1630.1380.088
Approx. for n > 40 1.261 n 0.966 n 0.812 n 0.633 n
Decision criterion: Reject H if | Δ o b s | > Δ c r i t i c a l .
Table 2. Leukemia data.
Table 2. Leukemia data.
AG Present (1)AG Absent (0)
WBCTimeWBCTime
0.23650.4456
0.0751560.365
0.431000.417
0.261340.157
0.6160.916
1.051080.5322
112113
1.741.94
0.54392.72
0.71432.83
0.94563.18
3.2262.64
3.5222.13
1017.930
101104
5.251043
1065
Source: Hand et al. [16].
Table 3. Results of the leave-one-out (LOO) technique for the data in Table 2 (exponential regression model).
Table 3. Results of the leave-one-out (LOO) technique for the data in Table 2 (exponential regression model).
PatientYAGWBC β 0 β 1 β 2 Y ^ Lower
50% CI
Upper
50% CI
Y in
Interval
16510.2303.1621.117−0.06473.71319.93199.966Yes
215610.0753.1381.044−0.05867.76318.17591.789No
310010.4303.1551.087−0.06370.05919.01294.907No
413410.2603.1451.059−0.06068.61818.45892.888No
51610.6003.1701.155−0.06675.15620.362102.077No
610811.0503.1561.074−0.06366.71118.20090.528No
712111.0003.1531.063−0.06265.98817.92289.522No
8411.7003.1611.171−0.06470.45819.19195.688No
93910.5403.1671.137−0.06673.57520.01099.818Yes
1014310.7003.1471.046−0.06165.68117.69989.356No
115610.9403.1621.123−0.06470.54219.16295.804Yes
122613.2003.1541.152−0.06262.68616.99685.112Yes
132213.5003.1501.158−0.06262.08716.70984.311Yes
141110.0003.0871.214−0.04453.98212.59668.882No
151110.0003.0871.214−0.04454.11912.58768.914No
16515.2003.1311.185−0.05758.81315.40079.113No
1765110.0003.2860.952−0.09133.0867.16841.304No
185600.4402.9831.263−0.04920.4815.35727.418No
196500.3002.9261.311−0.04319.6215.12426.274No
201700.4003.1871.090−0.06624.8066.49533.090Yes
21700.1503.2351.050−0.07126.5686.94135.657Yes
221600.9003.1891.089−0.06624.1086.36532.389Yes
232200.5303.1631.111−0.06424.1156.32732.342Yes
24301.0003.2441.040−0.07025.0596.65033.612No
25401.9003.2321.049−0.06823.2316.24231.377No
26202.7003.2331.045−0.06721.9905.89929.818No
27302.8003.2281.050−0.06721.7855.86329.525No
28803.1003.2061.069−0.06620.8995.65028.400Yes
29402.6003.2261.052−0.06721.9595.92329.687No
30302.1003.2351.045−0.06822.9616.16231.028No
313007.9003.1201.176−0.07613.2033.44217.634No
324010.0003.1671.088−0.05415.1603.75819.809Yes
3343010.0003.1121.286−0.1227.3571.7689.539No
Full model3.1611.112−0.064RMSE = 40.290
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Brunello, G.H.V.; Nakano, E.Y. A Bayesian Measure of Model Accuracy. Entropy 2024, 26, 510. https://doi.org/10.3390/e26060510

AMA Style

Brunello GHV, Nakano EY. A Bayesian Measure of Model Accuracy. Entropy. 2024; 26(6):510. https://doi.org/10.3390/e26060510

Chicago/Turabian Style

Brunello, Gabriel Hideki Vatanabe, and Eduardo Yoshio Nakano. 2024. "A Bayesian Measure of Model Accuracy" Entropy 26, no. 6: 510. https://doi.org/10.3390/e26060510

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop