Next Article in Journal
Advances in Mathematical Cryptography and Information Security Toward Industry 5.0
Previous Article in Journal
General Decay Stability of Theta Approximations for Stochastic Delay Hopfield Neural Networks
Previous Article in Special Issue
Test of the Equality of Several High-Dimensional Covariance Matrices: A Normal-Reference Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Modified Chi-Squared Goodness-of-Fit Tests for Continuous Right-Skewed Response Generalized Linear Models

by
Vilijandas Bagdonavičius
*,† and
Rūta Levulienė
Institute of Applied Mathematics, Vilnius University, 03225 Vilnius, Lithuania
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2025, 13(16), 2659; https://doi.org/10.3390/math13162659
Submission received: 20 June 2025 / Revised: 9 August 2025 / Accepted: 14 August 2025 / Published: 18 August 2025
(This article belongs to the Special Issue Computational Statistics and Data Analysis, 2nd Edition)

Abstract

Generalized linear models are applied for data analysis in various areas. One of the most important steps in fitting the model is to check the goodness-of-fit; however, there is a lack of such tests. Modified chi-squared goodness-of-fit tests for generalized linear models were constructed. Models with continuous right-skewed, possibly censored responses were considered. Explicit formulas of test statistics are provided in the case of gamma and inverse Gaussian models. The test power was investigated by simulation. The article presents real data examples to illustrate the application of tests.

1. Introduction

Generalized linear models (GLMs) [1] are among the most commonly used regression models in practice. The most frequently applied continuous GLM models are normal (Gaussian), gamma, and inverse Gaussian. The gamma and inverse Gaussian regression models are used to model right-skewed response variables, for example, modeling the lifetime distribution in reliability theory [2,3], claims prediction and premium computations in insurance [4,5,6], healthcare costs analysis [7,8], and estimation of outcomes in psychology [9].
The Gaussian GLM coincides with the most applied normal regression model, and the theory of normal regression can be found in an enormous number of articles and books on theoretical or applied statistics.
The fitting of a regression model consists of a set of steps, and one of the key steps is to check the goodness-of-fit (GOF). However, there is a lack of such tests, especially for continuous GLM such as gamma and inverse Gaussian regression. Just a few articles consider formal tests.
In many textbooks, the chi-squared approximation of Pearson and deviance statistics is recommended to test the gamma and inverse Gaussian regression models’ fit. It can lead to erroneous conclusions because this approximation is true if the shape parameter is large. It is clearly demonstrated in [10,11], for example. In [11], the authors propose approximations of Pearson and deviance statistics quantiles for the gamma regression model. Unfortunately, these approximations are given only in the case of a known shape parameter ν . The case of unknown ν is not investigated, so these results can not be used for goodness-of-fit. In [10], GOF tests for gamma and inverse Gaussian models are proposed by applying modifications of Cramer–von Mises and Anderson–Darling statistics. These statistics are computed using transformations of the responses via parametric estimates of their cumulative distribution functions (c.d.f.) and the inverse of the c.d.f. of the standard normal distribution. The theory is not developed rigorously: the asymptotic distributions of the test statistics are not found, and approximations of the distributions of the test statistics for finite sample sizes are not given. These tests can not be applied if the data are censored.
The score test for inverse Gaussian regression against inverse Gaussian mixture was constructed in [12]. The authors considered cases with complete and censored data, and critical values were obtained using the bootstrap. The disadvantage is that this test is not the omnibus test. It is recommended only in the case when only one of the possible alternatives (mixture) is suspected.
The current paper is a natural continuation of our paper [13]. In [13], modified chi-square tests were constructed for parametric accelerated failure time (AFT) models (see also [14]). To obtain these tests, at first, some asymptotic results for general parametric regression models (the AFT models being particular cases of these models) were rigorously obtained. In particular, asymptotic properties of the random vector of differences between the numbers of observed and “expected” failures in the intervals of a data partition (the partition is received using a uniquely defined rule) are derived. Application of general results for the following AFT models was considered: exponential, shape-scale (Weibull, log-normal, and log-logistics).
In the current article, we apply the general theorems of our paper [13] to obtain modified chi-squared goodness-of-fit tests for continuous right-skewed possibly censored GLM models.
The inverse Gaussian regression is GLM but not the AFT; thus, new tests are needed. The gamma regression is GLM, and it is also the AFT model; however, the article [13] on GOF for AFT models did not consider this model. Thus, we are currently investigating it.
Tests for gamma and inverse Gaussian regression models were investigated in detail. The Gaussian GLM by exponential transformation is transferred to log-normal AFT, which is considered in [13]. We did not write the formulas for this model because GOF tests for the normal regression are well-known and investigated in many papers.
Some authors consider diagnostic plots based on residuals for the gamma and inverse Gaussian models, but they are not formal GOF tests, so they can not be compared with the proposed tests because their significance and power can not be investigated. However, diagnostic graphs are useful at the initial stage of analysis, and in conjunction with formal GOFs, provide a broader view of data. The authors [15] proposed two new methods for the detection of influential observations in the case of the inverse Gaussian regression, and also presented a review of existing methods. In the article [16], adjusted deviance residuals for the gamma regression model were proposed and used for influence diagnostics. The construction of partial residuals for the inverse Gaussian regression was carried out in [17] for graphical model diagnostics.
The structure of the article is as follows: firstly (see Section 2), continuous GLMs are discussed; furthermore, in Section 3, the methodology of the modified chi-squared test is provided, the approach of choosing grouping intervals is explained, and the limit distribution of the test statistic is obtained. The results of the simulation study and the application for real data are presented in Section 4 and Section 5, respectively.

2. Gamma and Inverse Gaussian Regression Models

Let us consider the parametrization of the gamma distribution, denoted by Γ ( ν , μ ) , ν > 0 , and μ > 0 , with the following probability density function (p.d.f.):
f ( t , ν , μ ) = ν ν μ ν Γ ( ν ) t ν 1 exp { ( ν / μ ) t } , t > 0 ,
where ν is the shape parameter.
If T is a random variable with distribution Γ ( ν , μ ) , then the mean and the variance are
E ( T ) = μ , Var ( T ) = μ 2 / ν ,
and the cumulative distribution function (c.d.f.) is
F ( t , ν , μ ) = F χ 2 ν 2 ( 2 ν t / μ ) = 1 Γ ( ν ) γ ν , ν t μ ,
where F χ 2 ν 2 is the c.d.f. of the chi-squared distribution with 2 ν degrees of freedom; γ ( s , x ) = 0 x u s 1 e u d u , i.e., the lower incomplete gamma function.
Let us consider the parametrization of the inverse Gaussian distribution (also known as Wald distribution), denoted by I G ( ν , μ ) , ν > 0 , μ > 0 , with the following p.d.f.:
f ( t , ν , μ ) = ν 2 π t 3 exp ν ( t μ ) 2 2 μ 2 t , t > 0 ,
where ν is the shape parameter. If T is a random variable with distribution I G ( ν , μ ) , then the mean and the variance are
E ( T ) = μ , Var ( T ) = μ 3 / ν ,
and the c.d.f. is
F ( t , ν , μ ) = Φ ν t t μ 1 + e 2 ν μ Φ ν t t μ + 1 , t > 0 ,
where Φ is the c.d.f. of the standard normal distribution.
The gamma and the IG distributions belong to the exponential family with a p.d.f. of the following form:
f ( t , θ , ϕ ) = exp θ t b ( θ ) a ( ϕ ) + c ( t , ϕ ) , t > 0 .
For the gamma distribution,
θ = 1 μ , ϕ = 1 / ν , b ( θ ) = ln ( θ ) , a ( ϕ ) = ϕ ,
c ( t , ϕ ) = ϕ 1 ln ( t / ϕ ) ln t ln Γ ( ϕ 1 ) ,
and for the IG distribution,
θ = 1 2 μ 2 , ϕ = 1 / ν , b ( θ ) = 2 θ , a ( ϕ ) = ϕ , c ( t , ϕ ) = 1 2 ln ϕ 1 2 π t 3 ϕ 1 2 t .
Gamma regression model: The distribution of response T given the shape parameter ν and a vector of covariates z = ( 1 , z 1 , , z m ) T is G ( ν , μ ( z ) ) and the link function is logarithmic:
log ( μ ( z ) ) = β T z = β 0 + β 1 z 1 + + β m z m μ ( z ) = e β T z .
Thus, the p.d.f., c.d.f., mean, and variance given the vector of covariates are as follows:
f ( t , ν , β ) = ν ν e ν β T z Γ ( ν ) t ν 1 exp { ( ν / e β T z ) t } , t > 0 ,
F ( t , ν , β ) = F χ 2 ν 2 ( 2 ν t / e β T z ) = 1 Γ ( ν ) γ ( ν , ν X i / e β T z ) ,
where γ is the lower incomplete gamma function and
E ( T | z ) = μ ( z ) = e β T z , Var ( T | z ) = μ 2 ( z ) / ν = e 2 β T z / ν .
Sometimes the canonical (inverse) link function is used:
1 μ ( z ) = β T z = β 0 + β 1 z 1 + + β m z m .
Inverse Gaussian regression model: The distribution of the response T given the shape parameter ν and a vector of covariates z = ( 1 , z 1 , , z m ) T is I G ( ν , μ ( z ) ) and the link function is logarithmic:
log ( μ ( z ) ) = β T z = β 0 + β 1 z 1 + + β m z m .
Thus, the p.d.f, c.d.f, mean, and variance given the vector of covariates are as follows:
f ( t , ν , β ) = ν 2 π t 3 exp ν ( t e β T z ) 2 2 t e 2 β T z , t > 0 ,
F ( t , ν , β ) = Φ ν t t e β T z 1 + exp 2 ν e β T z Φ ν t t e β T z + 1 , t > 0 ,
E ( T | z ) = μ ( z ) = e β T z , Var ( T | z ) = μ 3 ( z ) / ν = e 3 β T z / ν .
Sometimes, the canonical (inverse squared) link function is used:
1 μ 2 ( z ) = β T z = β 0 + β 1 z 1 + + β m z m .
The gamma regression model is also an AFT model, and the IG model is not an AFT model.

3. Chi-Squared GOF Tests for Gamma and Inverse Gaussian Regression

3.1. Parameter Estimation

Let us consider the possibility of right-censored regression data:
( X 1 , δ 1 , z 1 ) , , ( X n , δ n , z n ) ,
where X i = Y i C i ,   δ i = 1 { Y i C i } ,   Y i are responses and C i are censorings.
Denote
λ ( x , θ ) = f ( x , θ ) / ( 1 F ( x , θ ) ) = f ( x , θ ) / S ( x , θ ) , Λ ( x , θ ) = ln S ( x , θ )
as the hazard and the cumulative hazard functions, respectively, depending on a finite-dimensional parameter θ . In the case of gamma and inverse Gaussian regression models, θ = ( μ , ν ) T .
The parametric log-likelihood function is the following:
( θ ) = i = 1 n ( δ i ln λ i ( X i , θ ) Λ i ( X i , θ ) ) ,
where ln λ i and θ ln λ i ( t , θ ) are presented in Section 3.2 and Section 3.3.

3.2. Gamma Regression

In the case of gamma regression, the following results were obtained:
ln λ i ( X i , θ ) = ν ln ν + ( ν 1 ) ln X i ν μ i X i ν ln μ i ln Γ ( ν ) ln 1 F i ( X i , ν , μ i ) ,
μ i = μ i ( z i , β ) = e β T z i i f   l i n k   i s   l o g a r i t h m i c , 1 / β T z i i f   l i n k   i s   i n v e r s e ,
ν ln λ i ( X i , θ ) = 1 + ln ν + ln X i X i / μ i ln μ i ψ ( ν ) + F i ( X i , ν , μ i ) / ν 1 F i ( X i , ν , μ i ) ,
where F i is the c.d.f. of the ith response, and ψ ( ν ) is the digamma function.
Note that the derivative of the probability density function of the ith response with respect to ν is
f i ( t , ν , β ) ν = ν ν μ ν Γ ( ν ) t ν 1 exp { ν t / μ i } ( 1 + ln ν ln μ i ψ ( ν ) + ln t t / μ i ) , t > 0 .
This implies that the derivative of the c.d.f. F i with respect to ν is the integral of the derivative of the p.d.f.:
F i ( X i , ν , μ ) ν = ν ν μ i ν Γ ( ν ) ( 1 + ln ν ln μ i ψ ( ν ) ) 0 X i u ν 1 exp { ν u / μ i } d u +
0 X i ( ln u u / μ i ) u ν 1 exp { ν u / μ i } d u = 1 Γ ( ν ) γ ( ν , ν X i / μ i ) ( 1 + ln ν ln μ i ψ ( ν ) ) +
+ 1 Γ ( ν ) γ 1 ( ν , ν i X i / μ i ) + ln ( μ i / ν ) γ ( ν , ν X i / μ i ) 1 ν Γ ( ν ) γ ( ν + 1 , ν X i / μ i ) ,
where γ 1 is the derivative of the lower incomplete gamma function with respect to the first argument (could be obtained using the function pgamma.deriv.unscaled of the R software version 4.4.3 package VGAM); the function gammainc of the R software version 4.4.3 package pracma computes the lower incomplete gamma functions, and functions gamma and digamma of the R version 4.4.3 package base return values of gamma and diggama functions, respectively.
β i ln λ i ( X i , θ ) = ν μ i X i μ i 1 μ i β i + F i ( X i , ν , μ i ) / β i 1 F i ( X i , ν , μ i ) ,
F i ( X i , ν , μ i ) β i = 1 μ i Γ ( ν ) γ ( ν + 1 , ν X i / μ i ) ν μ i 1 Γ ( ν ) γ ( ν , ν X i / μ i ) μ i β i ,
where μ i is defined by (6).

3.3. Inverse Gaussian Regression

In the case of inverse Gaussian regression, the following was obtained:
ln λ i ( X i , θ ) = 0.5 ln ν ln ( 2 π X i 3 ) ν ( X i μ i ) 2 2 μ i 2 X i
ln 1 Φ ν X i X i μ i 1 e 2 ν / μ i Φ ν X i X i μ i + 1 ,
μ i = μ i ( z , β ) = e β T z i i f   l i n k   i s   l o g a r i t h m i c , 1 / β T z i i f   l i n k   i s   i n v e r s e   s q u a r e d .
ν ln λ i ( X i , θ ) = 1 2 ν ( X i μ i ) 2 2 μ i 2 X i + 1 Φ ( a i ) e 2 ν / μ i Φ ( b i ) 1 ×
a φ ( a i ) 2 ν + e 2 ν / μ i 2 Φ ( b i ) μ i + b φ ( b i ) 2 ν ;
β i ln λ i ( X i , θ ) = 1 μ i 2 ν ( t μ i ) μ i 1 Φ ( a i ) e 2 ν / μ i Φ ( b i ) 1 ×
φ ( a i ) ν t + e 2 ν / μ i 2 ν Φ ( b i ) φ ( b i ) ν t μ i β i ,
where μ i is defined by (10) and
a i = ν X i X i μ i 1 , b i = ν X i X i μ i + 1 .

3.4. Grouping Intervals

Let θ ^ be the ML estimator of θ . Set
N i ( t ) = 1 { X i t , δ i = 1 } , N ( t ) = i = 1 n N i ( t ) ,
where N ( t ) is the number of responses in the interval [ 0 , t ] . The mean of N ( t ) is
E N ( t ) = E i = 1 n Λ i ( t X i , θ ) .
So, i = 1 n Λ i ( t X i , θ ^ ) may be interpreted as the expected number of responses in the interval [ 0 , t ] when the parametric model is true.
If the parametric model is true, then the difference N ( t ) i = 1 n Λ i ( t X i , θ ^ ) should take smaller values than in the case when the model is false.
Denote X ( 1 ) X ( n ) as the ordered X 1 , , X n . Set Λ ( i ) ( t , θ ) = Λ ( t , z ( ( i ) ) , θ ) ; here, z ( ( i ) ) is the vector of covariates corresponding to X ( i ) in the sample. Define
E k = i = 1 n Λ i ( X i , θ ^ ) = i = 1 n Λ ( i ) ( X ( i ) , θ ^ ) , E j = j k E k , j = 1 , , k ,
where E k is the estimator (under the true model) of the expected number of responses in the interval [ 0 , X ( n ) ] . If the model is true, then this value should not be far from n.
Divide the interval [ 0 , X ( n ) ] into k smaller intervals I j = ( a j 1 , a j ] . a 0 = 0 and a k = X ( n ) (do not identify a j with a j from Section 3.3) have the same expected number of responses in each interval. More precisely, the point a j is defined in the following way:
g ( a j ) = E j , g ( a ) = l = 1 n Λ ( l ) ( a X ( l ) ) .
The function g ( a ) is strictly increasing: g ( 0 ) = 0 and g ( X ( n ) ) = E k .
Set X ( 0 ) = 0 . Let us use notation l = 1 0 c l = 0 . Define
b i = g ( X ( i ) ) = l = 1 n Λ ( l ) ( X ( i ) X ( l ) , θ ^ ) = l = i + 1 n Λ ( l ) ( X ( i ) , θ ^ ) + l = 1 i Λ ( l ) ( X ( l ) , θ ^ ) .
Note that b 0 = 0 , b n = E k , and E j ( 0 , E k ] , j = 1 , . . . , k . Hence, there exists i j such that E j ( b i j 1 , b i j ] , which implies that a j ( X ( i j 1 ) , X ( i j ) ] . So, at first, i j is found. Then, a j is obtained, which is the unique root of the function h j ( a ) = g ( a ) E j in the interval ( X ( i j 1 ) , X ( i j ) ] , and is easily found by the bisection method because h j ( X ( i j 1 ) ) < 0 , h j ( X ( i j ) ) > 0 , and h j ( a ) is strictly increasing. Note that in the interval ( X ( i j 1 ) , X ( i j ) ] the function g ( a ) may be written as follows:
g ( a ) = l = 1 n Λ ( l ) ( a X ( l ) , θ ^ ) = l = i j n Λ ( l ) ( a , θ ^ ) + l = 1 i j 1 Λ ( l ) ( X ( l ) , θ ^ ) .

3.5. Test Statistic

The numbers of observed and expected responses in the interval I j = ( a j 1 , a j ] are as follows:
U j = i : X i I j δ i , e = E k / k , j = 1 , . . . , k .
The chi-squared test is based on the random vector
Z = ( Z 1 , , Z k ) T , Z j = 1 n ( U j e ) ,
i.e., on the differences between observed and expected values under the GLM model, including the number of responses in the intervals I j .
Set
A ^ j = U j / n , C j = 1 n i : X i I j δ i θ ln λ i ( X i , θ ^ ) , C ^ = ( C ^ 1 , , C ^ k ) .
If s is a dimension of θ , then C j are s × 1 vectors and C is a s × k matrix. Denote by A ^ the diagonal matrix with diagonal elements A ^ j .
The limit distribution of the random vector Z is found by applying the results of Theorems 3.1 and 3.2 of our article [13] (these theorems are also provided in [18] and Appendix A of this article). Note that these theorems can be applied not only for AFT models but also in the case of GLM models, because we can choose various forms of parametric hazard functions λ i ( u , θ ) for different i (see Appendix A).
The proof of the above-mentioned theorems in [13] was obtained by the following steps. At first, the asymptotic properties of the stochastic process,
H n ( t ) = 1 n ( N ( t ) i = 1 n 0 t λ i ( u , θ ^ ) Y i ( u ) d u ) ,
were investigated ([13], Lemma 3.1; Appendix A, and Lemma A1) by applying the central limit theorem (CLT) for martingales under well-known assumptions (see [19]) on the asymptotic properties (consistency and asymptotic normality) of the ML estimator θ ^ and the assumptions of CLT. Lemma 3.1 implies that the limit distribution of the random vector is Z (see [13], Theorem 3.1; Appendix A, and Theorem A1 ). This distribution is approximated by the normal distribution N k ( 0 , V ) . Theorem 3.2 (see Appendix A and Theorem A2) implies that the covariance matrix V is consistently estimated by the matrix:
V ^ = A ^ C ^ T i ^ 1 C ^ ,
where i ^ is a s × s matrix (see [18]) that can be written in the following form:
i ^ l l = 1 n i = 1 n δ i θ l ln λ i ( X i , θ ^ ) θ l ln λ i ( X i , θ ^ ) T ,
where derivatives of ln λ i are provided in (7) and (9) for the gamma regression, and in (11) and (12) for inverse Gaussian regression.
The chi-squared test for the hypothesis H 0 is based on the following statistic:
Y 2 = Z T V ^ Z ,
where V ^ is the general inverse of the matrix V ^ . The hypothesis is rejected with an approximate significance level of α if Y 2 > χ α 2 ( r ) , where r is the rank of the matrix V.
Note that in the case of the gamma regression, V is a full rank matrix ( r = k ); thus, V ^ = V ^ 1 . In the case of inverse Gaussian regression model, r a n k ( V ) = k 1 .

4. Simulation Study

The data are simulated by taking two covariates: z 1 —dichotomous (0—for half of the observations and 1—for the remaining observations) and z 2 U ( 20 , 30 ) . Different sample sizes n are considered. The Rice rule (see [20]) is used to determine the number of grouping intervals (see Table 1):
k = [ 2 n 3 ] .
In the assumptions of the limit distribution of the test statistic, it is supposed that k is fixed and the limit distribution is obtained to be chi-squared with k or k 1 degrees-of-freedom. Is the approximation accurate if k = [ 2 n 3 ] ? Note that if n is fixed, then k = [ 2 n 3 ] is also fixed. We know that if the size of the sample n * > k is sufficiently large and k = [ 2 n 3 ] , then the chi-squared approximation is accurate. But, taking into account that n is much larger than k (see Table 1), the approximation should also be good for sample size n. Simulations confirm this.

4.1. Simulation Under Hypotheses

The estimated significance levels are obtained using 5000 iterations. Tests with significance levels α = 0.05 and α = 0.1 are applied. Table 2 and Table 3 present the results for gamma and inverse Gaussian regression, respectively. Grouping intervals are computed using the Rice rule (15); moreover, different numbers of grouping intervals are considered to see how the convergence speed depends on the number of grouping intervals. The simulation results under the hypothesis demonstrate that the estimated significance levels approach the true value as the number of observations increases.

4.2. Simulation Under Alternatives

The data are simulated under various alternatives and values of parameters. For each of the sample sizes considered, we simulate 1000 replications and compute values of the test power. The significance level is 0.05.
In the case of inverse Gaussian regression, the test power under the following alternatives is investigated (see Table 4): gamma regression, log-normal, log-logistic, and Weibull AFT models. For gamma regression, the following alternatives are considered: inverse Gaussian and normal regression, log-normal, and log-logistic and Weibull AFT models, i.e., gamma regression models with shape and scale depending on covariates.
The results in the case of gamma regression are presented in Table 5. It has become evident that the test power under the IG regression alternative is large even for small sample sizes. The smallest test power values are in the case of the Weibull AFT model alternative, which is reasonable because gamma and Weibull models are very similar for some sets of parameters.
The results in the case of IG regression are presented in Table 6. It turned out that the test power under all considered alternatives is large even for small sample sizes. The smallest test power values are obtained when the alternative is the log-logistic AFT model.
Moreover, the simulation study suggested that in the case of the gamma and inverse Gaussian regression, the Rice rule (15) provides optimal grouping intervals ( k o p t = k R i c e ) for sample sizes n 60 , and for smaller samples the number of grouping intervals is k o p t = k R i c e 1 .

5. Real Data Examples

Example 1: Failure times (see Table 7) of 76 electrical insulating fluids tested at voltages, ranging from 26 to 38 kV ([21]), are considered.
The diagnostic methods (see [2]) suggest that the Weibull AFT–power rule model, i.e., l o g ( v i ) , should be used. The results of applying the modified chi-squared test are presented in Table 8. The analysis demonstrated that the Weibull AFT–power rule and gamma regression models are not rejected; however, AIC and BIC are smaller in the case of the gamma regression model. The inverse Gaussian regression model is strongly rejected.
Example 2: Hospital cost data (the dataset hospcosts from R package robmixglm) consist of a sample of 100 patients hospitalized at the Centre Hospitalier Universitaire Vaudois in Lausanne during 1999 for “medical back problems”. The response is the cost of stay, and the covariates are as follows: length of stay (in days; the logarithmic transformation was applied), admission type (0: planned; 1: emergency), insurance type (0: regular; 1: private), age (in years), sex (0: female; 1: male) and discharge destination (1: home; 0: another health institution). Data were analyzed in [8] considering the gamma regression and [22] in the Weibull model context.
The results of applying the modified chi-squared test are presented in Table 9. It is clear that the Weibull AFT–power rule and gamma regression models are not rejected. However, AIC is smaller in the case of the Weibull AFT–power rule model. The inverse Gaussian regression model is strongly rejected.
Example 3: Table 10 presents the results of an experiment designed to compare the performances of high-speed turbine engine bearings made out of five different compounds (see [2]). Data were fitted using a three-parameter Weibull distribution. The experiment tested 10 bearings of each type, and the times to fatigue failure were measured in units of millions of cycles.
The results using the modified chi-squared test are presented in Table 11. The gamma, Weibull AFT, and inverse Gaussian regression models are rejected. The results do not contradict the results in [2].

6. Conclusions

The modified chi-squared goodness-of-fit tests were constructed for gamma and inverse Gaussian regression models with possibly censored data. The methodology for grouping intervals was proposed, and practical recommendations based on the simulation results were presented. The results indicated that the test power under various considered alternatives is large even for small sample sizes. Moreover, in the case of the gamma and inverse Gaussian regression the Rice rule (15) provides optimal grouping intervals ( k o p t = k R i c e ) for sample sizes n 60 , and for smaller samples, the number of grouping intervals is k o p t = k R i c e 1 . The application of tests was shown using real data. The proposed tests are important in the data modeling process. They are robust to the model structure because in the case of misspecification of the model, the ”expected” number of responses will be far from the observed number of responses, and the test statistic will take large values; therefore, the hypothesis will be rejected. Thus, another model structure could be taken into consideration. The article fills the gap of formal omnibus tests for gamma and inverse Gaussian regression.

Author Contributions

Conceptualization, V.B. and R.L.; methodology, V.B. and R.L.; investigation, V.B. and R.L.; writing—original draft preparation, V.B. and R.L.; writing—review and editing, V.B. and R.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

In [13], the following results were obtained.
Condition A (consistency and asymptotic normality of the ML estimator θ ^ ):
θ ^ P θ 0 , 1 n ˙ ( θ 0 ) d N m ( 0 , i ( θ 0 ) ) , 1 n ¨ ( θ 0 ) P i ( θ 0 ) ;
n ( θ ^ θ 0 ) = i 1 ( θ 0 ) 1 n ˙ ( θ 0 ) + o P ( 1 ) ,
where
˙ ( θ ) = i = 1 n 0 τ θ ln λ i ( u , θ ) { d N i ( u ) Y i ( u ) λ i ( u , θ ) d u } .
Set
S ( 0 ) ( t , θ ) = i = 1 n Y i ( t ) λ i ( t , θ ) , S ( 1 ) ( t , θ ) = i = 1 n Y i ( t ) ln λ i ( t , θ ) θ λ i ( t , θ ) ,
S ( 2 ) ( t , θ ) = i = 1 n Y i ( t ) 2 ln λ i ( t , θ ) θ 2 λ i ( t , θ ) .
Condition B: This paragraph is with indentation. There exist a neighborhood Θ 0 of θ 0 and continuous bounded on Θ 0 × [ 0 , τ ] : functions
s ( 0 ) ( t , θ ) , s ( 1 ) ( t , θ ) = s ( 0 ) ( t , θ ) θ , s ( 2 ) ( t , θ ) = 2 s ( 0 ) ( t , θ ) θ 2 ,
such that for j = 0 , 1 , 2 ,
sup t [ 0 , τ ] , θ Θ | | 1 n S ( j ) ( t , θ ) s ( j ) ( t , θ ) | | P 0 a s n .
Condition B implies that uniformly for t [ 0 , τ ] ,
1 n i = 1 n 0 t λ i ( u , θ 0 ) Y i ( u ) d u P A ( t )
1 n i = 1 n 0 t λ ˙ i ( u , θ 0 ) Y i ( u ) d u P C ( t ) ,
where A and C are finite.
Lemma A1.
Under Conditions A and B, the following convergence holds:
H n d V o n D [ 0 , τ ] ,
where D [ 0 , τ ] is space of cadlag functions with the Skorokhod metric; V is a zero-mean Gaussian martingale, such that for all 0 s t ,
c o v ( V ( s ) , V ( t ) ) = A ( s ) C T ( s ) i 1 ( θ 0 ) C ( t ) .
Theorem A1.
Under Conditions A and B,
Z d Y N k ( 0 , V ) a s n ,
where
V = A C T i 1 ( θ 0 ) C .
Theorem A2.
Under conditions A and B, the following estimators of A j , C j , i ( θ 0 ) , and V are consistent:
A ^ j = U j / n , C ^ j = 1 n i = 1 n I j θ ln λ i ( u , θ ^ ) d N i ( u ) ,
and
i ^ = 1 n i = 1 n 0 τ ln λ i ( u , θ ^ ) θ ln λ i ( u , θ ^ ) θ T d N i ( u ) ,
V ^ = A ^ C ^ T i ^ 1 C ^ .

References

  1. McCullagh, P.; Nelder, J.A. Generalized Linear Models, 2nd ed.; Chapman and Hall: London, UK, 1989. [Google Scholar] [CrossRef]
  2. Lawless, J.F. Statistical Models and Methods for Lifetime Data; John Wiley & Sons: Hoboken, NJ, USA, 2003. [Google Scholar] [CrossRef]
  3. Meeker, W.Q.; Escobar, L.A. Statistical Methods for Reliability Data; John Wiley & Sons: Hoboken, NJ, USA, 1998; ISBN 978-1-118-62597-2. [Google Scholar]
  4. De Jong, P.; Heller, G.Z. Generalized Linear Models for Insurance Data; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar] [CrossRef]
  5. Haberman, S.; Renshaw, A.E. Generalized linear models and actuarial science. J. R. Stat. Soc. Ser. D (The Stat.) 1996, 45, 407–436. [Google Scholar] [CrossRef]
  6. Baione, F.; Biancalana, D. An individual risk model for premium calculation based on quantile: A comparison between generalized linear models and quantile regression. N. Am. Actuar. J. 2019, 23, 573–590. [Google Scholar] [CrossRef]
  7. Blough, D.K.; Ramsey, S.D. Using generalized linear models to assess medical care costs. Health Serv. Outcomes Res. Methodol. 2000, 1, 185–202. [Google Scholar] [CrossRef]
  8. Cantoni, E.; Ronchetti, E. A robust approach for skewed and heavy-tailed outcomes in the analysis of health care expenditures. J. Health Econ. 2006, 25, 198–213. [Google Scholar] [CrossRef]
  9. Ng, V.K.; Cribbie, R.A. Using the gamma generalized linear model for modeling continuous, skewed and heteroscedastic outcomes in psychology. Curr. Psychol. 2017, 36, 225–235. [Google Scholar] [CrossRef]
  10. Klar, B.; Meintanis, S.G. Specification tests for the response distribution in generalized linear models. Comput. Stat. 2012, 27, 251–267. [Google Scholar] [CrossRef]
  11. Shayib, M.A.; Young, D.H. Modified goodness of fit tests in gamma regression. J. Stat. Comput. Simul. 1989, 33, 125–133. [Google Scholar] [CrossRef]
  12. Desmond, A.F.; Yang, Z. Asymptotically refined score and GOF tests for inverse Gaussian models. J. Stat. Comput. Simul. 2016, 86, 3243–3269. [Google Scholar] [CrossRef]
  13. Bagdonavičius, V.B.; Levuliene, R.J.; Nikulin, M.S. Chi-squared goodness-of-fit tests for parametric accelerated failure time models. Commun.-Stat.-Theory Methods 2013, 42, 2768–2785. [Google Scholar] [CrossRef]
  14. Bagdonavičius, V.; Nikulin, M.S. Accelerated Life Models: Modeling and Statistical Analysis; Chapman & Hall/CRC: New York, NY, USA, 2002. [Google Scholar] [CrossRef]
  15. Amin, M.; Ullah, M.A.; Qasim, M. Diagnostic techniques for the inverse Gaussian regression model. Commun. Stat.-Theory Methods 2022, 51, 2552–2564. [Google Scholar] [CrossRef]
  16. Amin, M.; Amanullah, M.; Cordeiro, G.M. Influence diagnostics in the gamma regression model with adjusted deviance residuals. Commun. Stat.-Simul. Comput. 2017, 46, 6959–6973. [Google Scholar] [CrossRef]
  17. Imran, M.; Akbar, A. Diagnostics via partial residual plots in inverse Gaussian regression. J. Chemom. 2020, 34, e3203. [Google Scholar] [CrossRef]
  18. Bagdonavicius, V.; Kruopis, J.; Nikulin, M.S. Non-parametric Tests for Censored Data; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar] [CrossRef]
  19. Andersen, P.K.; Borgan, O.; Gill, R.D.; Keiding, N. Statistical Models Based on Counting Processes; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1993. [Google Scholar] [CrossRef]
  20. De La Rubia, J.M. Rice university rule to determine the number of bins. Open J. Stat. 2024, 14, 119–149. [Google Scholar] [CrossRef]
  21. Nelson, W. Hazard plotting methods for analysis of life data with different failure modes. J. Qual. Technol. 1970, 2, 126–149. [Google Scholar] [CrossRef]
  22. Marazzi, A.; Yohai, V.J. Adaptively truncated maximum likelihood regression with asymmetric errors. J. Stat. Plan. Inference 2004, 122, 271–291. [Google Scholar] [CrossRef]
Table 1. The number of grouping intervals using the Rice rule.
Table 1. The number of grouping intervals using the Rice rule.
n3050607080100150200
k6778891011
n300400500600800100015002000
k1314151618202225
Table 2. Estimates of the significance level α under the hypothesis, inverse Gaussian regression with log link, β 0 = 5 ,   β 1 = 2 ,   β 2 = 0.1 ,   ν = 3 .
Table 2. Estimates of the significance level α under the hypothesis, inverse Gaussian regression with log link, β 0 = 5 ,   β 1 = 2 ,   β 2 = 0.1 ,   ν = 3 .
n
200500100015002000 +
k R i c e = 11 k R i c e = 15 k R i c e = 20 k R i c e = 22 k R i c e = 25
0.11500.08600.06700.06780.05840.05
0.16380.14170.12040.11800.10980.10
k = 11 k = 11 k = 11 k = 11
0.07350.06380.06200.05200.05
0.12900.11390.11380.10600.10
k = 15 k = 15 k = 15
0.06840.06360.06000.05
0.12800.11660.11900.10
k = 20 k = 20
0.06560.06700.05
0.12940.11800.10
Table 3. Estimates of the significance level α under the hypothesis, gamma regression with log link, β 0 = 7 ,   β 1 = 4 ,   β 2 = 0.3 ,   ν = 0.45 .
Table 3. Estimates of the significance level α under the hypothesis, gamma regression with log link, β 0 = 7 ,   β 1 = 4 ,   β 2 = 0.3 ,   ν = 0.45 .
n
200500100015002000 +
k R i c e = 11 k R i c e = 15 k R i c e = 20 k R i c e = 22 k R i c e = 25
0.09100.07220.06920.06720.07200.05
0.15100.13620.13040.12640.13020.10
k = 11 k = 11 k = 11 k = 11
0.07880.06340.05560.05620.05
0.14180.12060.10980.10640.10
k = 15 k = 15 k = 15
0.07260.06020.06100.05
0.12780.11760.11280.10
k = 20 k = 20
0.07360.06200.05
0.12640.11980.10
Table 4. Definitions of alternative models.
Table 4. Definitions of alternative models.
ModelCDF
gamma regression with log link(4)
IG regression with log link(5)
Weibull AFT–log-linear 1 exp ( t / e β T z ) ν
log-logistic AFT–log-linear Φ ( ln t β T z ) / σ ; σ = 1 / ν
log-normal AFT–log-linear 1 1 + ( t / e β T z ) ν 1
gamma regression model with shape(4) with
and scale depending on covariates ν = ν 0 + ν 1 z 1 + + ν j z j
Table 5. Gamma regression. Powers against various alternatives. n: number of observations; k: optimal number of grouping intervals.
Table 5. Gamma regression. Powers against various alternatives. n: number of observations; k: optimal number of grouping intervals.
Alternativen; k
30; 550; 660; 780; 8100; 9150; 10200; 11250; 12300; 13
β 0 = 1 , β 1 = 1 , β 2 = 0.01 , ν = 3
IG with log link0.7060.9020.9430.98711111
Weibull AFT–log-linear0.2820.3170.3320.3460.3600.4270.4930.5850.674
log-logistic AFT–log-linear0.5420.5690.5710.6460.6820.7420.8370.8970.923
log-normal AFT–log-linear0.3610.3770.4180.4390.4590.4810.4870.5180.581
β 0 = 1 , β 1 = 1 , β 2 = 0.01 , ν = 2
IG with log link0.8130.9510.9730.99511111
Weibull AFT–log-linear0.2410.2570.2670.2720.2720.2830.3000.3440.372
log-logistic AFT–log-linear0.5830.6510.7050.7800.8380.9230.9740.9860.994
log-normal AFT–log-linear0.4150.4440.4890.5130.5780.6530.7090.7680.823
β 0 = 1 , β 1 = 1 , β 2 = 0.01 , gamma with shape ν
ν = 1 + 2 z 1 0.5350.5330.5690.6060.6800.7810.8730.9230.968
ν = 0.7 + 2 z 1 0.5450.6300.6510.7030.7720.9100.9540.9870.994
Table 6. Inverse Gaussian regression. Powers against various alternatives. n: number of observations; k: optimal number of grouping intervals.
Table 6. Inverse Gaussian regression. Powers against various alternatives. n: number of observations; k: optimal number of grouping intervals.
Alternativen; k
30; 550; 660; 780; 8100; 9150; 10200; 11250; 12
β 0 = 1 , β 1 = 1 , β 2 = 0.01 , ν = 2
gamma with log link0.7080.8160.8480.9090.9460.9820.9991
Weibull AFT–log-linear0.7390.8190.8960.9220.9700.9950.9991
log-logistic AFT–log-linear0.4330.5440.5490.5940.6640.7460.8420.891
log-normal AFT–log-linear0.5030.6160.6770.7170.8050.8870.9500.978
β 0 = 1 , β 1 = 1 , β 2 = 0.01 , ν = 1.5
gamma with log link0.7610.8950.9030.9590.9870.9960.9991
Weibull AFT–log-linear0.7950.8780.9190.9680.9930.9990.9931
log-logistic AFT–log-linear0.5750.6160.6650.7360.7740.8490.9200.966
log-normal AFT–log-linear0.4750.5040.5340.6170.6810.8020.8720.939
Table 7. Failure times T i for 76 electrical insulating fluids tested at voltages v i .
Table 7. Failure times T i for 76 electrical insulating fluids tested at voltages v i .
v i (kV)FrequencyFailure Times T i
2635.79 159.52 2323.70
28568.85 108.29 110.29 426.07 1067.60
30117.74 17.05 20.46 21.02 22.66 43.40
47.30 139.07 144.12 175.88 194.90
32150.27 0.40 0.69 0.79 2.75 3.91 9.88 13.95
15.93 27.80 53.24 82.85 89.29 100.58 215.10
34190.19 0.78 0.96 1.31 2.78 3.16 4.15 4.67 4.85 6.50
7.35 8.01 8.27 12.06 31.75 32.52 33.91 36.71 72.89
361536 0.35 0.59 0.96 0.99 1.69 1.97 2.07
2.58 2.71 2.90 3.67 3.99 5.35 13.77 25.50
3880.09 0.39 0.47 0.73 0.74 1.13 1.40 2.38
Table 8. Modified chi-squared test, k = 8 . Electrical insulating fluids data.
Table 8. Modified chi-squared test, k = 8 . Electrical insulating fluids data.
Model Y 2 p-ValueAICBIC
gamma regression; log link9.3350.3149604.9611.9
Weibull AFT–power rule model6.7040.4604607.6614.6
IG regression; log link88.4<0.0001651.3658.3
Table 9. Modified chi-squared test, k = 9 . Hospital cost data.
Table 9. Modified chi-squared test, k = 9 . Hospital cost data.
Model Y 2 p-ValueAICBIC
gamma regression; log link14.570.10351817.91838.8
Weibull AFT8.820.35741817.01837.8
IG regression with log link43.42<0.00011866.31887.1
Table 10. Failure times of bearing specimens.
Table 10. Failure times of bearing specimens.
CompoundFailures
I3.03 5.53 5.60 9.30 9.92 12.51 12.95 15.21 16.04 16.84
II3.19 4.26 4.47 4.53 4.67 4.69 5.78 6.79 9.37 12.75
III3.46 5.22 5.69 6.54 9.16 9.40 10.19 10.71 12.58 13.41
IV5.88 6.74 6.90 6.98 7.21 8.14 8.59 9.80 12.28 25.46
V6.43 9.97 10.39 13.55 14.45 14.72 16.81 18.39 20.84 21.51
Table 11. Chi-squared test, k = 6 . Bearing specimens data.
Table 11. Chi-squared test, k = 6 . Bearing specimens data.
Model Y 2 p-Value
gamma regression; log link13.380.0373
Weibull AFT13.200.0216
IG regression with log link17.370.0080
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bagdonavičius, V.; Levulienė, R. Modified Chi-Squared Goodness-of-Fit Tests for Continuous Right-Skewed Response Generalized Linear Models. Mathematics 2025, 13, 2659. https://doi.org/10.3390/math13162659

AMA Style

Bagdonavičius V, Levulienė R. Modified Chi-Squared Goodness-of-Fit Tests for Continuous Right-Skewed Response Generalized Linear Models. Mathematics. 2025; 13(16):2659. https://doi.org/10.3390/math13162659

Chicago/Turabian Style

Bagdonavičius, Vilijandas, and Rūta Levulienė. 2025. "Modified Chi-Squared Goodness-of-Fit Tests for Continuous Right-Skewed Response Generalized Linear Models" Mathematics 13, no. 16: 2659. https://doi.org/10.3390/math13162659

APA Style

Bagdonavičius, V., & Levulienė, R. (2025). Modified Chi-Squared Goodness-of-Fit Tests for Continuous Right-Skewed Response Generalized Linear Models. Mathematics, 13(16), 2659. https://doi.org/10.3390/math13162659

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop