1. Introduction
Air pollution poses a significant environmental threat with dire consequences for public health, a concern that is likely to escalate in the coming years. Despite growing awareness, our understanding of the intricate relationship between air pollution and human health remains incomplete. In the present study, we endeavor to bridge this gap by examining and quantifying the impact of particulate matter with a diameter less than or equal to 2.5 µm () in conjunction with meteorological variables and incidences of circulatory system diseases. Empirical evidence concerning the extremely high concentrations in Thailand’s northeastern and northern regions during certain parts of the year highlights the seriousness of the air pollution situation. Indeed, the levels in these regions were four times higher in April 2022 than the World Health Organization’s annual air quality standard, which indicated hazardous atmospheric conditions. Forecasting levels presents a formidable challenge, due to their inherent variability, with daily data within a given region typically conforming to an inverse Gaussian distribution. Moreover, statistical metrics, such as the coefficient of variation, can be used to test for significance in this context.
The inverse distribution involves two positive parameters, the mean (
) and the scale (
), and it maintains a strong connection with the normal distribution. It has been applied to financial modeling, survival analysis, and reliability theory, among others. Its versatility makes it particularly well-suited for analyzing data that deviate from normality. The inverse Gaussian distribution has been utilized across multiple disciplines, including biology (Hsu et al. [
1]), pharmacokinetics (Weiss [
2]), survival analysis (Khan et al. [
3]), demography (Ewbanks [
4]), and finance (Balakrishna and Rahul [
5], Punzo [
6]). Characterized by its ability to model processes with delay and right-skewed data, it provides a suitable framework for analyzing
concentrations. The distribution of the data tends to be right-skewed, due to occasional high-pollution events, resulting in a longer tail of extreme values on the higher end. This skewness indicates that
data are not symmetric. Utilizing the inverse Gaussian distribution to model
concentrations allows for more accurate statistical analyses and predictions. It effectively captures the asymmetry and heavy tail of
data, providing insights into the likelihood of extreme pollution events and aiding in assessing its impact on public health. While data transformation using techniques such as logarithms can sometimes be applied to achieve symmetry in the data, the inherent characteristics of
data make the inverse Gaussian distribution a more effective choice for modeling and analyzing these environmental data.
Statistical inference can be obtained by testing hypotheses or parameter estimation. The confidence interval comprising an estimator’s minimum and maximum values is the most widely utilized interval estimation method for a parameter. Hsieh [
7] calculated the confidence interval for the inverse Gaussian distribution’s coefficient of variation and used it to examine actual data pertaining to runoff volumes at Jug Bridge, Maryland. Non-informative priors for the confidence interval of the common coefficient of variation across two inverse Gaussian distributions were developed by Kang et al. [
8]. Chankham et al. [
9] expanded upon this concept by offering estimators for the coefficients of variation across various inverse Gaussian distributions.
The coefficient of variation, a unit-free metric, can be used to assess the dispersion within data. The confidence intervals for the coefficient of variation and its many functional derivatives have been estimated by a multitude of authors using various approaches (Pang et al. [
10], Hayter [
11], Nam and Kwon [
12]). Nonetheless, the current study aimed to estimate the confidence interval for the common coefficient of variation of many inverse Gaussian distributions, which had not been done previously. This was accomplished through rigorous analysis and empirical investigation. In previous studies, researchers have devised confidence intervals for the common coefficient of variation of both normal and non-normal distributions. Gupta et al. [
13] calculated the asymptotic variance of the common coefficient of variation of normal distributions and then formulated confidence intervals for it. A method of estimating the common coefficient of variation for multiple zero-inflated lognormal distributions was presented by Tian [
14], which used the idea of the generalized confidence interval (GCI). Behboodian and Jafari [
15] utilized generalized
p-values and the GCI in a similar endeavor. Ng [
16] employed the generalized variables methodology to estimate confidence intervals for the common coefficient of variation across multiple lognormal distributions. Liu and Xu [
17] introduced a technique for constructing confidence interval estimates for the common coefficient of variation across multiple normal populations, using a confidence distribution interval method. In order to estimate the confidence interval for the weighted coefficient of variation in two-parameter exponential distributions, Thangjai and Niwitpong [
18] suggested employing the adjusted method of variance estimates recovery (MOVER) methodology. They contrasted the effectiveness of this strategy with that of the high-sample-size method and the generalized confidence interval (GCI) approach. According to their findings, positive coefficient of variation values are a good fit for the adjusted MOVER approach. They also found that the weighted coefficient of variation in two-parameter exponential distributions has a best-fit confidence interval, which can be estimated using the GCI approach. The adjusted GCI was used by Thangjai et al. [
19] in their recent work, to estimate the confidence interval for the common coefficient of variation of normal distributions using computational techniques. In a comparative analysis with the GCI and adjusted MOVER methods, the adjusted GCI proved effective with small sample sizes, while the computational approach showed efficacy with larger sample sizes. Enhancements in computational methodology and the MOVER framework have been made, to build upon the work of Thangjai et al. [
20]. The fiducial GCI methodology is the most effective method for estimating the confidence interval for the common coefficient of variation of several lognormal distributions. The study by Thangjai et al. [
20] was restricted in scope, though, as it only looked at positively skewed lognormal distributions.
As mentioned earlier, many researchers have developed confidence intervals for the common coefficient of variation of several normal and non-normal distributions. However, there has not yet been an investigation of statistical inference using the common coefficient of variation of several inverse Gaussian distributions. In the present study, our primary aim was to estimate the confidence interval for the common coefficient of variation of several inverse Gaussian distributions. We achieved this by employing various methodologies, such as the GCI, adjusted MOVER, the Bayesian credible interval (BCI), the highest posterior density (HPD).BCI, the fiducial confidence interval (FCI), and the HPD.FCI. We evaluated the effectiveness of these methods through rigorous analysis of their coverage probabilities and average lengths for various scenarios in a simulation study. We also applied them in a real-world scenario to analyze data from various areas in northeastern Thailand.
2. Methods
Let
be a random sample from
k inverse Gaussian distributions denoted as
The distribution function for
is defined as
where
and
are the mean and scale parameters, respectively. Following the method of Ye et al. [
21], the respective mean and variance of
are
and
Hence, the coefficient of variation of
can be represented as
For a random sample
from
we obtain
and
. Therefore,
and
. First, we consider the square of the coefficients of variation, denoted as
Since random variables
and
are independent, we can obtain an unbiased estimator for
as follows:
Using the distributional properties of
and
, we obtain
where
and
We can use
instead and reparametrize the probability density function of
as follows:
Using the moment of the inverse Gaussian from Chhikara and Folks [
22], we obtain
Therefore, the first two moments for later use are given by
The approximately unbiased variance estimate of
is
For the estimator for the common variance (
),
, its weighted average based on
k individual samples can be defined as
where
Accordingly, the common coefficient of variation can be defined as
Now, we derive the methods to estimate the confidence interval for the common coefficient of variation of multiple inverse Gaussian distributions.
2.1. The Generalized Confidence Interval (GCI)
Weeranhandi [
23] pioneered the development of the generalized pivotal quantity (GPQ) concept and exploited it to provide the framework for the GCI. This approach provides flexibility as it does not require the assumption of normality, making it well-suited for the inherently skewed and asymmetric nature of the inverse Gaussian distribution. This flexibility ensures accurate modeling and analysis of such data. Additionally, the GCI method accounts for the uncertainty of multiple parameters simultaneously, leading to more accurate confidence intervals estimation.
Let be a random sample with density function , where and are the parameters of interest and is a nuisance parameter. Let be the observed values of . The GPQ is needed to satisfy the following two properties:
The probability distribution of function does not depend on the nuisance parameter.
The observed values for are independent of the nuisance parameter.
Given that is the percentile of then becomes the two-sided GCI for and . Therefore, it is essential to use the GPQs for and to estimate the confidence interval for the common coefficient of variation for several inverse Gaussian distributions.
For
k individual random samples from inverse Gaussian distributions, the GPQs for
and
can, respectively, be defined as by Ye et al. [
21]:
and
where
denotes the observed values of
. Accordingly, the GPQ for
becomes
By using Equations (13) and (14), the GPQ for
can be formulated as
where
and
is given by
where
. Therefore, the
two-sided confidence interval for
based on the GCI is
where
and
denote the
-th and
-th percentiles of the distribution of
, respectively. Algorithm 1 delineates the step-by-step computational process for constructing the GCI:
Algorithm 1: The GCI method |
Generate from an inverse Gaussian distribution. Calculate and . Generate and from Chi-square and standard normal distributions, respectively. Calculate , , , and , using Equations (15), (16), (17), and (18), respectively. Compute , using Equation ( 18) Repeat Steps 2–5 5000 times. Complete the confidence interval for of the GCI.
|
2.2. The Bayesian Methods
Bayesian inference involves revising initial beliefs by considering fresh evidence, leading to the deviation of a posterior probability. For random samples
from
the joint likelihood function can be expressed as
By utilizing Bayes’ theorem to forecast the posterior distribution, we derive
where
and
constitute the prior distributions for
and
, respectively. To formulate the Fisher information matrix for the unknown parameters, we employ the second-order partial derivative of the log-likelihood function. This process involves utilizing mathematical techniques to extract crucial information about the parameters’ uncertainty. The Fisher information matrix derivation hinges on analyzing the log-likelihood function’s behavior with respect to the unknown parameters as follows:
In the ensuing sections, we cover the application of Jeffreys’ prior rule to constructing both BCI and HPD intervals. Within the Bayesian framework, the methodology pertaining to the inverse Gaussian distribution hinges significantly on parameter selection [
24]. Instead of directly using the mean, a more beneficial method involves using the reciprocal of the mean and considering
, where
is employed for parameterization. This approach aids in deriving manageable expressions for both the joint and marginal posterior distributions. The Jeffreys prior rule can be used to generate the posterior distribution when both parameters are unknown, thereby eliminating the need to assume a prior. Although choosing a natural conjugate prior is a viable alternative, it presents challenges in selecting hyperparameter values, which can introduce bias in the inference. By using the Jeffreys prior rule, the marginal posterior distributions for both
and
can, respectively, be derived as
and
where
;
is the cumulative distribution function for the standard normal distribution; and
and
are the maximum likelihood estimators for
and
, respectively, given that all of the observations are considered in
and
, respectively. In the present work, we assume that both
and
are unknown. Utilizing the Markov chain Monte Carlo (MCMC) technique, Gibbs sampling was employed to determine the posterior and fiducial distributions of the parameter (Gelfand and Smith, [
25]). One popular strategy in Bayesian methodology is sampling from the posterior distribution by iteratively going over each variable one after the other and sampling from its conditional distribution while keeping the other variables fixed. The Gibbs sampler verifies the correctness of the sampled data by guaranteeing convergence, using both numerical and graphical summaries. Through iteration, the sampler progressively refines the samples, ultimately yielding a representative approximation of the posterior distribution. This methodological approach offers a robust means of inference, particularly in complex Bayesian models where direct sampling may be infeasible. In summary, the Gibbs sampler is a reliable tool for exploring posterior distributions, offering both theoretical grounding and practical applicability in statistical analyses. After generating BCIs and HPD intervals for the common coefficient of variation of multiple inverse Gaussian distributions by replacing the posterior densities of
in Equations (4), (5), (12), and (14), the
two-sided confidence interval for the common coefficient of variation based on the BCI method becomes
where
and
are the lower and upper bounds of the intervals for the
equal-tailed confidence interval and HPD interval of
, respectively. The highest posterior density (HPD) interval represents the shortest interval within the HPD region. Within this region, all included values exhibit higher probability densities compared to any values outside the region. Consequently, the HPD interval is a critical measure in Bayesian statistics that offers a concise summary of the most probable values based on the given data and model. This enables the precise calculation of the interval, ensuring that the values inside it have the highest possible densities relative to those outside.
2.3. The Fiducial Confidence Interval (FCI)
Due to the pioneering work by Fisher [
26], fiducial inference has emerged as a pivotal concept that is a significant departure from conventional statistical methods. Parameters in fiducial inference are regarded as random variables. Thus, their distributions (called fiducial distributions) are based only on the observed data and do not depend on any previous distributions. Fiducial intervals can be interpreted similarly to Bayesian credible intervals, thereby providing a direct probabilistic interpretation of parameter estimates. This can be more intuitive for practitioners who prefer understanding uncertainty in terms of probability. The FCI offers a blend of frequentist and Bayesian features, thereby providing an interpretable, flexible, and computationally efficient method for parameter estimation and uncertainty quantification. They are particularly beneficial in settings where traditional methods are challenged or where prior information is unavailable or undesirable. In addition, random samples are produced by utilizing point and interval estimates of obscure parameters along with maximum likelihood estimation. Despite its complexity, the application of the fiducial method to the parameters of the inverse Gaussian distribution, especially when coupled with MCMC, can be accomplished in the following manner:
and
where
and
are the maximum likelihood estimators for
and
, respectively.
In the present study, we employed the Gibbs sampler technique to draw samples from the fiducial distribution. In addition, we implemented a simultaneous procedure for estimating the fiducial values, wherein we substituted the Bayesian posterior distribution with the fiducial distribution. Thus, by inserting the posterior densities of into Equations (4), (5), (12), and (14) we were able to apply the FCI to estimate the confidence interval for the common coefficient of variation of multiple inverse Gaussian distributions.
Therefore, the
two-sided confidence interval for the common coefficient of variation based on the FCI method becomes
where
and
are the lower and upper bounds of the
equitailed FCI and HPD intervals of
, respectively.
The value of
for the BCI, HPD.BCI, FCI, and HPD.FCI can be estimated by applying Algorithm 2:
Algorithm 2: the BCI, HPD.BCI, FCI, and HPD.FCI methods: |
Generate from an inverse Gaussian distribution. Compute MLEs of and and set and . Generate and from their respective posterior distributions, as detailed in Equations (24) and (25), utilizing the updated sample observations. Repeat with Steps 2 and 3, starting from the current values of and for t (), which denotes the number of MCMC replications, and ending with the and for each. Calculate the desired parameters after discarding the first 1000 samples as burn-in. Compute the confidence interval for of the BCI and FCI. Compute HPD.BCI and HPD.FCI, using in the R software package version 4.2.2.
|
2.4. Adjusted MOVER
We use the large sample technique to compute the adjusted MOVER, building on the foundation for MOVER first proposed by Donner and Zou [
27]. Again, Equation (
13) can be used to define the aggregated estimator for the common mean. For the two parameters of interest,
and
, we can employ their estimators,
and
, which are independent, to defined the lower limit L and the upper limit U
as
where
is the
-th percentile of the standard normal distribution. Through the application of the central limit theorem, the variance estimates for
at
can be, respectively, defined as
and
where
and
are the lower limits of
and
, respectively. Furthermore, the variance estimates for
at
can be, respectively, defined as
and
where
and
are the upper limits of
and
, respectively. Based on Equation (
30), the
confidence limit for
can be expressed as
and
For
k independent samples to which the adjusted MOVER is applied,
L and
U for the sum of
can be written as
The variance estimates of
at
and
, where
, are provided by
and
In this study, the lower and upper bounds of
were established using the Wald confidence interval as follows:
When utilizing the large sample concept for the interval estimation of
, the variance estimate for
can be defined as
Therefore, the
two-sided confidence interval for
, using the adjusted MOVER with the Wald confidence interval, becomes
where
and
where
is defined as in Equation (
6). The confidence interval, derived from the adjusted MOVER method, can be readily constructed, using Algorithm 3.
Algorithm 3: The adjusted MOVER method |
Generate from an inverse Gaussian distribution. Compute and . Compute and . Compute and . Compute the confidence interval for .
|
3. The Simulation Study and Results
We employed the R statistical software in conjunction with Monte Carlo simulation techniques to calculate coverage probabilities and average lengths for the different confidence interval estimation methods: GCI, BCI, HPD.BCI, FCI, HPD.FCI, and the adjusted MOVER. The most effective method for the given scenario achieved a coverage probability that met or exceeded the nominal confidence level of 0.95 and had the shortest average length. In each simulation, 10,000 random samples from an inverse Gaussian distribution along with 5000 pivotal quantities for the GCI, BCI, and FCI methods were generated. The number of populations was . The sample sizes were or 100, or 7, and or
Plots of the coverage probabilities and average lengths for the six confidence interval estimation methods across various sample sizes are provided in
Figure 1,
Figure 2,
Figure 3 and
Figure 4. The simulation results for
are reported in
Table 1. It can be seen that the coverage probabilities for the GCI were greater than or close to the nominal confidence level of 0.95 for most scenarios. One noteworthy result from the study was that the FCI performed well in situations where the sample sizes of each group were unequal. However, the coverage probabilities for the BCI, HPD.BCI, HPD.FCI, and the adjusted MOVER were below the 0.95 threshold in all the scenarios. As the sample sizes were increased, the coverage probabilities for the BCI, HPD.BCI, HPD.FCI, and the adjusted MOVER were better but still under the nominal confidence level of 0.95. When examining the average lengths, the adjusted MOVER typically provided the shortest, with the BCI and HPD.BCI following closely. From the findings for
in
Table 2, it can be observed that the GCI method provided coverage probabilities of at least 0.95 only when the sample size was 100. The other methods yielded similar results to those for
. When considering the average lengths for both
and 5, it was found that an increase in sample size and scale resulted in shorter average lengths in all cases. Overall, the GCI outperformed the others in the various simulation study scenarios by meeting the criteria for both efficiency and accuracy.
4. An Empirical Example with Real Data
For this part of the study, we used daily
data from October to December 2023 from the Nakhon Ratchasima, Nong Khai, and Ubon Ratchathani provinces in northeastern Thailand (
Table 3 [
28]). The Q–Q plots in
Figure 5,
Figure 6 and
Figure 7 illustrate that the positive data conformed to inverse Gaussian distributions, as also evidenced by the lowest Akaike information criterion (AIC) and Bayesian information criterion (BIC) values in
Table 4 and
Table 5, respectively. We first utilized the Kolmogorov–Smirnov (KS) test to determine whether the
data from the three provinces followed inverse Gaussian distributions. This test evaluates whether the data adheres to an inverse Gaussian distribution by comparing the
p-values for a particular significance level, commonly set as 0.05. If the
p-value is below this threshold, the null hypothesis is rejected, indicating that the data do not follow the specified distribution. For our analysis, the KS test produced
p-values of 0.2172 for Nakhon Ratchasima, 0.2146 for Nong Khai, and 0.2812 for Ubon Ratchathani. Since all these
p-values were above the 0.05 significance level, we concluded that the
data from these provinces fitted the inverse Gaussian distribution model.
Table 6 provides the summary statistics derived from the three
datasets. The common coefficient of variation for these three datasets was calculated as 0.4732. We subsequently employed the various methods detailed herein to estimate the
confidence interval for the common coefficient of variation of these three inverse Gaussian distributions, as reported in
Table 7. Similar to our simulation findings for
, we found that the GCI provided a coverage probability close to the nominal confidence level of 0.95 and the shortest average length.
5. Discussion
Chankham et al. [
29] proposed the simultaneous confidence interval for the ratios of the coefficients of variation of multiple inverse Gaussian distributions and its application to
data. We extended this idea to construct estimators for the confidence interval for the common coefficient of variation of several inverse Gaussian distributions. In our case, the findings from the simulation study imply that the GCI method is superior to the FCI, HPD.FCI, BCI, HPD.FCI, and adjusted MOVER methods for almost all cases. However, the FCI method also performed well, particularly when the sample sizes were not equal, by providing coverage probabilities that met or exceeded the nominal confidence level. The GCI method, while not always providing the shortest interval, balanced precision and reliability well, offering a reasonably short interval with high coverage probability. The Bayesian method was not the most effective approach, likely due to the hyperparameter configuration in the Jeffrey’s prior distribution. Moreover, other algorithms, such as Lindley’s approximation, could be used to address the issue of a non-closed-form posterior distribution. Inverse Gaussian distributions are inherently asymmetrical, and understanding this asymmetry impacts the performance of any applied statistical method. Transforming data from an inverse Gaussian distribution to a normal distribution can potentially make the data more symmetric, depending on the transformation method and the nature of the data [
30]. Common transformations used to normalize data include the Box–Cox transformation and the Yeo–Johnson transformation. While inverse Gaussian distributions are asymmetrical, understanding this asymmetry is crucial for choosing effective statistical methods.
The simulation results were supported by the results for the real-world example of analyzing levels in northeastern Thailand; data from Nakhon Ratchasima, Nong Khai, and Ubon Ratchathani provinces were examined using the inverse Gaussian distribution to assess the variability in levels in these provinces. Once again, the GCI method proved to be the most effective by providing the most accurate confidence interval with the shortest average length, thus confirming its better precision and reliability in the context of data analysis. These findings have practical implications for public health policies and pollution control strategies. Accurate statistical modeling of levels could enhance the capability of forecasting pollution episodes, allowing for timely public health warnings and better air quality management. Understanding and managing is crucial, due to its significant health and environmental impacts, and statistical methods provide valuable tools for analyzing and interpreting data, leading to informed decisions and policies aimed at reducing pollution and protecting public health.
6. Conclusions
In this paper, we employed the GCI, BCI, HPD.BCI, FCI, HPD.FCI, and adjusted MOVER methods to estimate the confidence interval for the common coefficient of variation of multiple inverse Gaussian distributions. We evaluated their performances across various simulation scenarios, using coverage probability and average length metrics. The simulation study findings revealed that the coverage probabilities of the GCI, FCI, and HPD.FCI methods met or exceeded the nominal confidence level of 0.95. Notably, the GCI method emerged as the most suitable approach for both and 5. As an empirical example, datasets from the Nakhon Ratchasima, Nong Khai, and Ubon Ratchathani provinces in northeastern Thailand were utilized to assess the efficacy of the various methods. Once again, the GCI method outperformed the others by yielding the confidence interval with the shortest length, which was consistent with the simulation results for . Hence, the GCI method is recommended for estimating the confidence interval for the common coefficient of multiple inverse Gaussian distributions, with the FCI and HPD.FCI methods also being suitable under certain circumstances.
The GCI method was chosen due to its flexibility and suitability for skewed distributions, making it an appropriate choice for inverse Gaussian distributions. Its inclusion was further supported by extensive documentation and widespread recognition in statistical inference. The adjusted MOVER method is known for providing robust interval estimates even with small sample sizes, enhancing its practical applicability and reinforcing its acceptance in the statistical literature. The BCI method leverages Bayesian inference to incorporate prior information, producing credible intervals that reflect posterior distributions. The distinct methodological foundation of the Bayesian framework justifies the BCI method’s inclusion. The FCI method combines elements of both frequentist and Bayesian inference, providing a unique approach to interval estimation, and its theoretical appeal makes it a compelling choice for comparative analysis. However, while these four indicators are robust and well-recognized, to enhance the contributions a more exhaustive comparative study is necessary. The inclusion of additional methods, such as bootstrap methods, which provide insights into the robustness and performance of interval estimation in various scenarios, and percentile intervals, which offer a non-parametric alternative to the selected methods, could enrich the analysis. Exploring different Bayesian methods with varying priors will also add depth to the study.
We used various approaches to estimate the common confidence interval for several inverse Gaussian distributions. The findings from an analysis of concentrations from three pollution monitoring stations in northeastern Thailand aligned well with the outcomes of a simulation study, with the GCI method performing the best in most scenarios, thereby confirming the validity of our approach. Our research will be extended in the future to determine the simultaneous confidence intervals for the difference between the percentiles of multiple inverse Gaussian distributions.