Next Article in Journal
Stochastic Dynamics of Fusion Low-to-High Confinement Mode (L-H) Transition: Correlation and Causal Analyses Using Information Geometry
Previous Article in Journal
Binary Classification with Imbalanced Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Bayesian Forecasting of Bounded Poisson Distributed Time Series

Department of Statistics, Feng Chia University, Taichung 40724, Taiwan
*
Author to whom correspondence should be addressed.
Entropy 2024, 26(1), 16; https://doi.org/10.3390/e26010016
Submission received: 21 November 2023 / Revised: 15 December 2023 / Accepted: 20 December 2023 / Published: 22 December 2023

Abstract

:
This research models and forecasts bounded ordinal time series data that can appear in various contexts, such as air quality index (AQI) levels, economic situations, and credit ratings. This class of time series data is characterized by being bounded and exhibiting a concentration of large probabilities on a few categories, such as states 0 and 1. We propose using Bayesian methods for modeling and forecasting in zero-one-inflated bounded Poisson autoregressive (ZOBPAR) models, which are specifically designed to capture the dynamic changes in such ordinal time series data. We innovatively extend models to incorporate exogenous variables, marking a new direction in Bayesian inferences and forecasting. Simulation studies demonstrate that the proposed methods accurately estimate all unknown parameters, and the posterior means of parameter estimates are robustly close to the actual values as the sample size increases. In the empirical study we investigate three datasets of daily AQI levels from three stations in Taiwan and consider five competing models for the real examples. The results exhibit that the proposed method reasonably predicts the AQI levels in the testing period, especially for the Miaoli station.

1. Introduction

Time series of counts are non-negative integer chronological data that are widely investigated in various fields, such as epidemiology, economics, meteorology, and crime. These applications follow Zeger [1] who presented a log-linear regression model to analyze the time series of polio cases in the United States. The literature commonly uses Poisson or binomial regression models to accommodate the characteristic of integer-valued data, but such models are no longer suitable once we deal with time series datasets. Zhu et al. [2] proposed a mixture integer-valued autoregressive conditional heteroscedastic (INARCH) model to deal with computer-aided dispatch calls data. After Ferland et al. [3] proposed the integer-valued generalized autoregressive conditional heteroscedastic (INGARCH) model for time series of counts, numerous extensions and applications of INGARCH models have emerged. Notable examples include those in Weiß [4], Fokianos et al. [5], Zhu [6], Chen and Lee [7], and Chen and Lee [8]. Specifically, Xu et al. [9] introduced a new dispersed INARCH model to examine the time series of dengue cases in Singapore. Chen and Lee [8] investigated the causal relationship between climate and criminal behavior using INGARCHX models to reflect one or more exogenous series. Moreover, Chen and Khamthong [10] proposed nonlinear INGARCHX models for weekly dengue case counts in Thailand, incorporating two climatological covariates: temperature and precipitation.
Many time series of counts often have a large number of zero values like when dealing with rare diseases or rare events, resulting in excessive series dispersion. Therefore, a large strand of the literature discusses the zero-count (zero-inflation) approach to capturing such phenomena. Lambert [11] presented a count data model, the zero-inflated Poisson (ZIP) model, whose observed values are random events with a large number of zero count data in a unit of time. The ZIP model is a mixed model consisting of a Poisson assignment and zero probability, which are commonly used in quality control and accidents (e.g., Wang [12], Yau et al. [13], and Jazi et al. [14]).
There is another type of integer-valued time series, called categorical time series. Forecasting bounded ordinal time series data is challenging due to their discrete and constrained nature. Bounded ordinal data refer to data with a natural order, but limited to a specific range, such as survey responses on a scale from 1 to K, where K > 2 . One example is the air quality index (AQI), in which the AQI value is divided into six levels, and the data are mostly between levels 1 and 2. Related research on AQI by Chen and Chiu [15] and Liu et al. [16] indicates that AQI data belong to bounded time series data. The time series of AQI levels are ordinal, because the class increases as the AQI interval value increases.
Unlike the classical Poisson distribution and the zero-inflated Poisson (ZIP) distribution, the zero-one-inflated bounded (ZOB) Poisson distribution model, as proposed by Liu et al. [16], is finite across states 0 , 1 , , K . This makes it more suitable for fitting restricted states to match the levels of the data. Additionally, Weiß and Jahn [17] introduced soft-clipping binomial INGARCH models as time series models for bounded counts, which also accommodate negative autocorrelations. Liu et al. [18] presented a review of the developments in INGARCH models over the past five years, focusing on unbounded and bounded non-negative counts.
The ZOB Poisson distributed model is bounded and primarily concentrated on the categories of state 0 and state 1, which occur with larger probabilities compared with other categories. The ZOB Poisson distribution to study normalcy-dominant ordinal time series is as follows:
P ( Y = k ) = π 1 I { k = 0 } + π 2 I { k = 1 } + ( 1 π 1 π 2 ) λ k / k ! i = 0 K λ i / i ! , k = 0 , 1 , , K ,
where π 1 0 and π 2 0 are the inflated parameters for states 0 and 1. The constraint π 1 + π 2 < 1 , λ > 0 , is the intensity parameter, and the integer K 2 is a given upper bound.
In the ZOB Poisson distribution, the case of ( π 1 , π 2 ) = ( 0 , 0 ) is called a bounded Poisson distribution. When ( π 1 , π 2 ) ( 0 , 0 ) , the phenomenon of inflation that may occur is in state 0 or state 1, thus allowing the datapoint to fit a normalcy-dominant ordinal time series with two possible normal states, 0 and 1. In this ZOB Poisson autoregressive (ZOBPAR) model, the intensity function λ t adopts an autoregressive structure so that λ t varies with time.
Instead of employing the method of maximum likelihood estimator as in Liu et al. [16], the present study uses Bayesian inference with the Markov chain Monte Carlo (MCMC) method to estimate unknown parameters. The advantages of using Bayesian methods include: (1) allowing the incorporation of prior knowledge or beliefs to form a prior distribution that more accurately describes the uncertainty of parameter estimates; (2) enabling simultaneous analysis of all unknown parameters and forecasting; and (3) providing probabilities that parameters fall within credible intervals, which offer a more intuitive and direct way to understand and communicate the uncertainty of parameter estimates.
This study makes three contributions to the existing literature. (1) We incorporate exogenous variables to develop the ZOBPARX model, thus accommodating more flexible situations. (2) We employ Bayesian parameter estimation methods for quantifying uncertainty. (3) We predict one-step-ahead categories for out-of-sample forecasts. The aim is to demonstrate that the model from the ZOBPARX family can effectively capture the dynamic relationships in ordinal data and provide reasonable predictions for the out-of-sample (test) period. To our knowledge, no Bayesian approach or forecast evaluation is currently available for this proposed model.
This paper proceeds as follows. Section 2 reviews the methodologies, the MCMC sampling scheme, and forecasting method. Section 3 explains the results of simulation studies and the accuracy of estimates. Section 4 provides an empirical study of AQI-level forecasts and evaluates forecast accuracy. Finally, Section 5 offers concluding remarks.

2. ZOBPAR and ZOBPARX Models

We denote the bounded Poisson distribution by P * ( λ , K ) , where λ is the mean of Poisson, and K is an upper bound of category. The bounded Poisson distribution equals the ZOB Poisson distribution in Equation (1) with ( π 1 , π 2 ) = ( 0 , 0 ) . Let D t be an independent and identically distributed (i.i.d.) sequence having the following probability distribution:
P ( D t = 0 ) = π 1 ,   P ( D t = 1 ) = π 2 ,   P ( D t = 2 ) = 1 π 1 π 2 .
From the definition by Liu et al. [16], the ordinal time series data Y t are then said to follow a ZOBPAR model if:
Y t = ( 2 D t ) D t + ( D t 1 ) D t W t / 2 ,
with W t F t 1 P * ( λ t , K ) and:
λ t = α 0 + α 1 Y t 1 + β 1 λ t 1 ,
where α 0 > 0 , α 1 > 0 , β 1 0 , F t is the available information up to time t, and D t satisfying Equation (2) is independent of W t . To achieve the stationary condition of Equation (3), Liu et al. [16] provide a sufficient condition as:
β 1 + K ( 1 π 1 π 2 ) α 1 / 4 < 1 .
Note that Equation (3) is similar to the INGARCH(1,1) model of Weiß [19] for capturing the dynamic structure of λ t .
When Y t follows the ZOBPAR model, we can conduct the conditional probability of Y t as:
P ( Y t = k F t 1 ) = π 1 I { k = 0 } + π 2 I { k = 1 } + ( 1 π 1 π 2 ) λ t k / k ! i = 0 K λ t i / i ! ,
where k = 0 , 1 , , K , and its conditional mean and variance are:
E ( Y t F t 1 ) = π 2 + ( 1 π 1 π 2 ) g ( λ t ) g ( λ t ) λ t ,
V a r ( Y t F t 1 ) = π 2 ( 1 π 2 ) + ( 1 π 1 π 2 ) ( 1 2 π 2 ) g ( λ t ) g ( λ t ) λ t + ( 1 π 1 π 2 ) g ( λ t ) g ( λ t ) ( 1 π 1 π 2 ) ( g ( λ t ) g ( λ t ) ) 2 λ t 2 ,
where the function g ( λ t ) = i = 0 K λ t i / i ! , and g ( λ t ) and g ( λ t ) are its first and second derivatives, respectively.
In addition to considering the effects of exogenous variables, we extend Equation (3) by incorporating these variables, denoted as X j , t . We then define the ZOBPAR model with exogenous variables (ZOBPARX) as:
λ t = α 0 + α 1 Y t 1 + β 1 λ t 1 + j = 1 S γ j X j , t 1 ,
where γ j is the parameter of the jth exogenous variable, and we restrict γ j > 0 to ensure non-negativeness of λ t . S is the number of exogenous variables.

3. Bayesian Inference and Forecasting

Let θ = ( π , α ) be the unknown parameter vector of the ZOBPAR model, where π = ( π 1 , π 2 ) , and α = ( α 0 , α 1 , β 1 ) . Based on Equation (4), the log-likelihood function of { Y 1 , Y 2 , , Y n } is:
l o g L ( θ ) = t = 2 n l t ( θ ) ,
and
l t ( θ ) = I { Y t = 0 } log ( π 1 + ( 1 π 1 π 2 ) / g ( λ t ) + I { Y t = 1 } log ( π 2 + ( 1 π 1 π 2 ) λ t / g ( λ t ) + I { Y t 2 } log ( ( 1 π 1 π 2 ) λ t Y t / ( Y t ! g ( λ t ) ) ,
where λ t is defined by Equation (3) or (5), and g ( λ t ) = i = 0 K λ t i / i ! .
Using Bayes’ rule, we multiply the likelihood and the priors to give the conditional posterior probability as follows:
p ( θ l Y , X , θ l ) L ( Y X , θ ) × p ( θ l ) ,
where Y = ( Y 1 , , Y n ) , and X = ( X 1 , , X n ) . Here, θ l denotes each parameter in θ , p ( θ l ) is the prior density of θ l , and θ l is the vector of all model parameters except for θ l . We choose noninformative priors for all parameters and restrict the parameters to satisfy the specified region. p ( π ) = I ( A 1 ) , and p ( α ) = I ( A 2 ) , where A 1 = { π 1 0 , π 2 0 a n d π 1 + π 2 < 1 } and A 2 = { α 0 > 0 , α 1 > 0 , β 1 0 a n d β 1 + K ( 1 π 1 π 2 ) α 1 / 4 < 1 } .

3.1. MCMC Sampling Schemes

We designed an MCMC sampling scheme to obtain the posterior estimates of θ for the ZOBPAR model using an adaptive Metropolis-Hastings (MH) algorithm by Chen and So [20]. This approach combines the random walk Metropolis (RW-M) algorithm and the independent kernel MH (IK-MH) algorithm, with a total number of iterations N and burn-in iterations M. The steps of the MCMC sampling scheme are as follows.
Step 1:
Give initial values of θ [ 0 ] = ( π 1 [ 0 ] , π 2 [ 0 ] , α 0 [ 0 ] , α 1 [ 0 ] , β 1 [ 0 ] ) .
Step 2:
At the ith iteration, update π [ i ] = ( π 1 [ i ] , π 2 [ i ] ) by the MH algorithm conditional on α [ i 1 ] . If i M , then the RW-M algorithm is used; otherwise, the IK-MH algorithm is employed.
Step 3:
Similarly, at the ith iteration, α [ i ] = ( α 0 [ i ] , α 1 [ i ] , β 1 [ i ] ) is updated by the MH algorithm conditional on π [ i ] . If i M , then the RW-M algorithm is used; otherwise, the IK-MH algorithm is employed.
Step 4:
Repeat Step 2 and Step 3 until the number of iterations equals N.
The procedures for the RW-M and IK-MH algorithms for α run as follows.
(i)
The steps of RW-M for α from i = 1 , , M are as follows.
Step 1:
α * = α [ i 1 ] + ϵ , where ϵ N ( 0 , c I ) , I is identity matrix, and α [ i 1 ] is the estimate of α at the (i− 1)th iteration.
Step 2:
Accept α * as α [ i ] with acceptance probability:
P r = min 1 , p ( α * ) p ( α [ i 1 ] ) > u , where u U ( 0 , 1 ) ,
where p ( α ) is the conditional posterior distribution of α . If u < P r , set α [ i ] = α * ; otherwise, set α [ i ] = α [ i 1 ] .
Step 3:
Repeat Step 1 and Step 2 for each MCMC iteration during the burn-in iterations.
Note that the stepsize c in Step 1 controls the acceptance rate for α .
According to Gelman et al. [21], a suitable value of the acceptance rate for good convergence is between 25 % and 50 % .
(ii)
The steps of IK-MH for α from i = M + 1 , , N are as follows:
Step 1:
α * = μ α + ϵ , where ϵ N ( 0 , Ω α ) , with μ α and Ω α as the sample mean and sample variance of the estimates of α from the burn-in samples.
Step 2:
Update α * as α [ i ] with acceptance probability:
P r = min 1 , p ( α * ) g ( α [ i 1 ] ) p ( α [ i 1 ] ) g ( α * ) > u , where u U ( 0 , 1 ) ,
where g is a Gaussian proposal density with mean μ α and variance Ω α . If u < P r , set α [ i ] = α * ; otherwise, set α [ i ] = α [ i 1 ] .
Step 3:
Repeat Step 1 and Step 2 until the total number of iterations is N.
The RW-M and IK-MH procedures for π are similar to the procedures of α .

3.2. Bayesian Forecasting

In the empirical illustration, we conduct one-step-ahead forecasting to predict Y ^ t + 1 . The Bayesian forecasting procedure runs as follows (for example, K = 4 ):
Step 1:
Obtain the posterior means of parameters θ ^ = ( π ^ , α ^ ) , where π ^ = ( π ^ 1 , π ^ 2 ) and α ^ = ( α ^ 0 , α ^ 1 , β ^ 1 ) ,
π ^ j = i = M N π j [ i ] N M , j = 1 , 2 , and α ^ j = i = M N α j [ i ] N M , j = 0 , 1 , 2 , ( where α ^ 2 = β ^ 1 ) ,
and then put α ^ = ( α ^ 0 , α ^ 1 , β ^ 1 ) into Equation (3) to calculate λ ^ t + 1 = α ^ 0 + α ^ 1 Y t + β ^ 1 λ t .
Step 2:
Compute the conditional probability of:
P ( Y t + 1 = k θ , F t ) = π ^ 1 I { k = 0 } + π ^ 2 I { k = 1 } + ( 1 π ^ 1 π ^ 2 ) λ ^ t + 1 k / k ! i = 0 K λ ^ t + 1 i / i ! ,
where k = 0 , 1 , 2 , 3 , 4 by π ^ = ( π ^ 1 , π ^ 2 ) and λ ^ t + 1 .
Step 3:
Generate a random number u U ( 0 , 1 ) .
(1)
If u P ( Y t + 1 = 0 ) , then Y ^ t + 1 = 0 ; else, move to the next statement.
(2)
If P ( Y t + 1 = 0 ) < u P ( Y t + 1 1 ) , then Y ^ t + 1 = 1 ; else, move to the next statement.
(3)
If P ( Y t + 1 1 ) < u P ( Y t + 1 2 ) , then Y ^ t + 1 = 2 ; else, move to the next statement.
(4)
If P ( Y t + 1 2 ) < u P ( Y t + 1 3 ) , then Y ^ t + 1 = 3 ; else, move to the next statement.
(5)
Y ^ t + 1 = 4 .
We employ this procedure to generate one-step-ahead forecasts for bounded ordinal time series data.

4. Simulation Study

In this section we conduct a simulation study to examine the established Bayesian MCMC method. There are two models: Model 1 is a ZOBPAR model, and Model 2 is a ZOBPARX model, specified as follows to generate simulated data with sample size n = 500 and n = 1000 :
Model 1:
λ t = 0.02+ 0.7 y t 1 + 0.2 λ t 1 , P ( Y t = k F t 1 ) = 0.01 I { k = 0 } + 0.3 I { k = 1 } + 0.69 λ t k / k ! i = 0 4 λ t i / i ! , k = 0 , 1 , , 4 , π = ( π 1 , π 2 ) = ( 0.01, 0.3) ; α = ( α 0 , α 1 , β 1 ) = ( 0.02, 0.7, 0.2) .
Model 2:
λ t = 0.02+ 0.7 y t 1 + 0.2 λ t 1 + 0.3 X t 1 , P ( Y t = k F t 1 ) = 0.01 I { k = 0 } + 0.3 I { k = 1 } + 0.69 λ t k / k ! i = 0 4 λ t i / i ! , k = 0 , 1 , , 4 , π = ( π 1 , π 2 ) = ( 0.01, 0.3) ; α = ( α 0 , α 1 , β 1 , γ 1 ) = ( 0.02, 0.7, 0.2, 0.3) .
We generate X t in Model 2 from a Gamma distribution with X t G ( 2 , 2 ) . Both models follow a bounded Poisson distribution with bound K = 4 .
A computational issue arises with a slow convergence rate of parameter estimates, when we simultaneously estimate a set of unknown parameters. To implement the estimation of model parameters, we use the Bayesian method with a designed MCMC sampling. We employ two sampling mechanisms to speed up the convergence of MCMC sampling. First, we use an adaptive MCMC sampling method, which combines the RW-M algorithm and the IK-MH algorithm, as mentioned in the Section 3.1. Second, due to autocorrelation, we choose every five samplers as a thinning chain in MCMC outputs. The total number of MCMC iterations is 20,000, which includes a burn-in period of 8000 iterations. Based on R codes, the computational times for the ZOBPAR model with sample sizes n = 500 and n = 1000 are 159 and 309 s, respectively. For the ZOBPARX model, the CPU times are approximately 241 and 464 s. Parameter estimations are efficiently completed in under eight minutes, and all computations are performed on a computer equipped with an i7-11700 CPU and 64 GB of RAM.
To examine the convergence of MCMC, we monitor the trace and autocorrelation function (ACF) plots of MCMC samplers during the after burn-in iterations. We provide trace plots and ACF plots for MCMC samples based on simulated data from Model 1 and Model 2 with sample size n = 1000 (see Figure 1 and Figure 2). We expect that the trace plots of all parameters randomly vary between a reasonable constant range, and that the ACF plots present no autocorrelation observed in the MCMC samplers. This demonstrates that we have converged MCMC samplers, and that the parameter estimates are reliable.
The results in Table 1 reveal that the averages of posterior means based on 100 replications are close to the true values, and 95 % Bayesian credible intervals ( 2.5 and 97.5 ) can accurately cover the corresponding true values. This confirms that the designed MCMC method provides accurate parameter estimates. To examine the consistency of parameter estimates, we offer the results of parameter estimates for different sample sizes n = 500 and 1000. The results indicate that the proposed Bayesian method presents accurate parameter estimates with small standard deviations as the sample size increases.
The results of parameter estimates for the ZOBPARX model (Model 2) are presented in Figure 2 and Table 2. All MCMC samples converge, and the posterior means are close to the true values. Again, all parameter estimates are close to the true values, while the posterior standard deviations are small as the sample size increases.

5. Empirical Application

In order to demonstrate our proposed method, we investigate daily AQI levels from the weather stations of Pingtung, Miaoli, and Zuoying in Taiwan. We collect daily AQI levels for each station from 30 December 2016 to 31 January 2020 for a total of 1129 observations. To evaluate the forecasting performance, we separate the whole sample period into two periods: the training period with 764 observations for in-sample model estimation and the testing period with 365 days for out-of-sample forecasts. By a rolling window approach, we conduct one-step ahead forecasts for 365 days and evaluate the forecast performance by computing the accuracy of AQI level forecasts.
Precipitation can effectively reduce particulate matter concentrations (PM10 and PM2.5) in the air. When it rains, these particles are captured by raindrops and carried to the ground. This process can lead to a significant improvement in AQI, particularly in reducing particulate pollution. We thus treat daily accumulated precipitation (PRE) and the winter dummy variable as exogenous variables and consider different combinations of ZOBPARX models to study the effects of exogenous variables. We define daily accumulated PRE as an exogenous variable X t , which negatively correlates with the AQI level. We propose two distinct transformations to fulfil the coefficient constraint ( γ j > 0 ) and to address the negative correlation between yesterday’s PRE and today’s AQI level. The first transformation involves taking the reciprocal of PRE (TF1_PRE), set as X 1 , t , and the second involves computing the exponential of the negative PRE (TF2_PRE), designated as X 2 , t .
T F 1 _ P R E : X 1 , t = 1 / ( X t + 1 ) and T F 2 _ P R E : X 2 , t = exp ( X t ) .
These transformations of X t can ensure having positive estimates of the coefficients in the ZOBPARX model. To investigate the effect of seasonality, we consider the exogenous variables of the winter dummy variable and month dummy variables. For the winter dummy variable, we define S t as:
S t = 1 , from   October   to   March , 0 , otherwise .
According to the definitions of exogenous variables, we consider following ZOBPARX models with different combinations of exogenous variables for three datasets.
  • ZOBPARX 1: (exogenous variable: TF1_PRE)
    λ t = α 0 + α 1 Y t 1 + β 1 λ t 1 + γ 1 X 1 , t 1 .
  • ZOBPARX 2: (exogenous variable: TF2_PRE)
    λ t = α 0 + α 1 Y t 1 + β 1 λ t 1 + γ 1 X 2 , t 1 .
  • ZOBPARX 3: (exogenous variables: TF1_PRE and winter dummy)
    λ t = α 0 + α 1 Y t 1 + β 1 λ t 1 + γ 1 X 1 , t 1 + γ 2 S t .
  • ZOBPARX 4: (exogenous variables: TF2_PRE and winter dummy)
    λ t = α 0 + α 1 Y t 1 + β 1 λ t 1 + γ 1 X 2 , t 1 + γ 2 S t .
In the process of data collection, we faced problems of missing observations in AQI and weather covariate. To avoid loss of information, we adopt the k-nearest neighbors (knn) algorithm by Cover and Hart [22] to impute the missing values, which takes the same approach as Chen and Chiu [15].
Chiu [23] conducted an experiment on a selected period of the AQI time series without missing values from three randomly chosen sites. Chiu [23] then introduced missing values at every 10th datapoint and imputed these using the KNN-imputation method from the R package “DMwR” [24]. This process compares different h values for AQI prediction, using some variables to minimize the mean absolute error (MAE) and root mean squared error (RMSE). The results suggest that employing four days ( h = 4 ) with data on rainfall, temperature, wind direction, PM2.5, and seasonal dummy variables, which are closest to the day with missing AQI, and then taking the weighted average of the AQI values from these four days, serves as an effective imputation for the missing AQI. Following the same line, we impute missing values and refer to the results of Chiu [23] and Chen and Chiu [15] to set h = 4 for the imputation of missing values in PRE and AQI, as shown below.
(1)
PRE with missing values: If the PRE value is missing at day t, then we pick the valid data of the nearest station to impute;
(2)
AQI with missing values: In the knn algorithm, we set h = 4 to impute the missing AQI values by the corresponding weather data in the nearest four days concerning PRE, daily average temperature, daily wind direction, PM2.5, and seasonal dummy value closest to the day with missing AQI; the imputation of a missing AQI value is the weighted average of the AQI values of these four days. The weights decrease as the distances to their neighbor increase, and we use a Gaussian kernel function to take the weights from their distances. For more details, one can refer to Torgo [24].
Following the U.S. EPA’s classification of AQI levels, we classify AQI values into four levels, each represented by different colors for visual identification, as in Table 3. Figure 3 presents the time series plots of daily AQI values from 30 December 2016 to 31 January 2020 for Pingtung, Miaoli, and Zuoying. Observing these time series plots, the changes of AQI values in Pingtung and Zuoying are more volatile than in Miaoli, and the AQI values are low from May to September for each year at each station. A large proportion of AQI values under 100 at the three stations distinctly demonstrates the phenomenon of zero-one inflation in the data concerning AQI levels. Figure 4 plots the proportions of AQI levels by stacked bar charts for each month and each station. It is obvious that the levels of AQI are quite different among months, while the period from June to August has better air quality, with large proportions of 0 and 1 levels of AQI.
The time series plots of daily PRE for the three stations, presented as the exogenous variable PRE in Figure 5, indicate that the rainy season occurs periodically from May to September. The highest daily cumulative PRE typically occurs during July and August. We provide the summary statistics of AQI and PRE for the three stations in Table 4. The means and standard deviations of AQI of Pingtung and Zuoying are both larger than their values for Miaoli. This is consistent with the findings in Figure 3. The maximum values of AQI of Pingtung and Zuoying both exceed 200, which is at the level of “poor”. Similar to the PRE values, Pingtung and Zuoying have larger mean values and standard deviations of PRE than Miaoli. We observe that the weather is relatively unstable in Pingtung and Zuoying.
We employ MCMC methods for parameter estimation and one-step-ahead AQI forecasts during the out-of-sample period for each dataset. To evaluate the performance of our proposed method in forecasting AQI levels, we fit a ZOBPAR model and four ZOBPARX models, using each to perform one-step-ahead forecasts for 365 days through the rolling window method. To the best of our knowledge, apart from the ZOBPAR model, no other model can investigate zero-one inflation with a bounded Poisson distribution. Therefore, we consider a ZOBPAR model and four ZOBPARX models for the comparative analysis. Table 5 presents two evaluation metrics—accuracy and penalty—used to assess the AQI level forecasts over a period of 365 days for each dataset, as predicted by these five different models.
Categories 0 and 1 of AQI levels both represent satisfactory air quality and have no effect on human health, whereas other categories are harmful to human health. We treat all categories split into two levels for calculating the accuracy of AQI-level forecasts. To evaluate the effectiveness of model prediction, we propose a penalty mechanism that reflects a numerical ‘cost’ or ‘penalty’ for each type of misclassification in our categorical data. The scores of penalty in Table 5 are the sum of daily penalties during the forecasting period. The lower the penalty score, the better the model.
We assign a numerical ‘cost’ or ‘penalty’ to each type of misclassification in our categorical data, adopting a concept akin to weighted absolute error. This approach involves defining a weight or cost for each type of error (e.g., predicting Category A when the actual category is B) and then calculating an overall score based on these weights. If there are differences between actual AQI levels and predicted AQI levels, denoted as w i | Y i Y ^ i | , are −1, −2, 1, or 2, then we design a penalty mechanism with corresponding weights of 1, 2, 4, and 8, respectively. The aim of this method is to ensure that the weights accurately reflect the relative importance of each category, with the underestimated category receiving a higher penalty than the overestimated one, and to accurately represent the cost of each type of error in our specific context.
For the Pingtung site, the ZOBPARX 1 model shows the highest accuracy (67.4%), indicating it most frequently predicts the AQI levels correctly. ZOBPARX 1 also leads with a lower penalty score than the other models. For the Miaoli site, the ZOBPARX 3 model has the highest accuracy (87.9%) and the lowest penalty. For the Zuoying site, the ZOBPARX 3 model shows the highest accuracy (69.9%) but the ZOBPAR model provides the lowest penalty. The ZOBPARX 2 model performs as the second-best model, provideing reasonable accuracy and a low penalty.
It appears that the models generally perform well in forecasting for Miaoli’s air quality. In contrast, these models exhibit lower accuracy in Pingtung and Zuoying. In summary, different models excel in different locations, highlighting the importance of selecting location-specific models for accurate AQI forecasting.
The Environmental Protection Bureau of the Pingtung County government states that the county’s geographic location at the southernmost end, often positioned downwind of prevailing winds, contributes to its air pollution issues. The weaker wind speeds in winter, fewer days of rainfall, and poor atmospheric dispersion conditions, compounded by the region’s topography that can create localized eddy currents, are all factors that lead to higher concentrations of air pollutants in Pingtung County during the autumn and winter seasons, often exceeding standard levels [25]. We need to tailor models for local geographical and environmental characteristics to improve forecast accuracy for specific regions like Pingtung.
For the three datasets, we present the posterior estimates of parameters for the best performing models in Table 6 and also provide convergence diagnostic checking, the convergence diagnostic (CD) test, and inefficiency factors (Ineff.) to demonstrate converged parameter estimates. The CD test introduced by Geweke [26] has a p-value greater than 0.05, and the Ineff. of Chib [27] has a small value that is far less than MCMC iterations; both reveal that the chain of MCMC samples converges. The last two columns of Table 6 present the p-values of CD tests and the Ineff. values. All p-values of CD tests are greater than 0.05, and all Ineff. values are far smaller than MCMC iterations. This means that all parameter estimates converge and are reliable for making inferences.
To check the adequacy of the fitted model, we compute standardized Pearson’s residuals by Jung et al. [28] as follows:
Z t = y t E ( Y t F t 1 ) V a r ( Y t F t 1 ) , t = 1 , , n .
For Zuoying, in Table 6 we observe that the posterior estimate of α 1 in the ZOBPARX 2 model is larger than α 0 and β 1 . This indicates that the AQI level of the previous day has a significant effect on the current day’s mean AQI level. The estimates of the probabilities π 1 and π 2 show that π 2 is larger than π 1 in the case of Zuoying. Focusing on the parameters of exogenous variables, the parameter γ 1 for X 2 , t in the ZOBPARX 2 model for Pingtung has an estimate of 0.0457.
For Pingtung, the ZOBPARX 1 model produces more accurate predictions among the competing models. Bayesian parameter estimations for the ZOBPARX 1 model are presented in Table 6. We also observe that the posterior estimate of α 1 has a larger magnitude than that of α 0 and β 1 , and again the estimate of π 2 is greater than π 1 at this site.
For Miaoli, the ZOBPARX 3 model is the best-performing model with the highest accuracy among the competing models. The parameter estimation results for the ZOBPARX 3 model are presented in Table 6. The posterior estimate of α 1 is again larger than both α 0 and β 1 , and the estimates of π 1 and π 2 reveal that π 2 is larger than π 1 in the Miaoli dataset. When examining the parameters of exogenous variables, the parameters γ 1 and γ 2 for X 1 , t and S t in the ZOBPARX 3 model for Miaoli demonstrate a much smaller magnitude on AQI levels, with estimates of 0.0200 and 0.0478, respectively.
These findings underscore the significance of α 1 in AQI forecasting models for each site, emphasizing the crucial influence of the previous day’s AQI levels. Consistently across the three sites, the estimate remains π 2 > π 1 . Furthermore, PRE or the winter dummy variable plays an important role in the forecasts.
Figure 6 displays the time plots and ACF plots of the standardized Pearson’s residuals for the three sites, derived using the most accurate forecasting models. The diagnostic checking results suggest that the proposed models adequately capture the changes in AQI levels at these sites.
To gain a detailed understanding of the performances of AQI level forecasts in each dataset, we compute the proportions of forecasted AQI levels ( Y t ^ ) and compare them with the actual proportions of AQI levels observed during the 365-day out-of-sample forecasting period, as shown in Table 7. Focusing on the results of Miaoli, the best model obtained in Table 5 provides accurate results on AQI level forecasts, with the proportions of forecasted levels ( Y t ^ ) close to the true proportions. In Pingtung, due to local geographical and environmental conditions, even the best forecasting model does not perform well in terms of the proportions of forecasted AQI levels.

6. Conclusions

This paper presents the issues of modeling and forecasting bounded ordinal time series data with a special focus on AQI levels. The data are bounded and predominantly concentrated on a few categories, such as states 0 and 1, which occur with high probabilities. We demonstrate ZOBPAR models, both with and without exogenous variables, in order to capture the dynamic changes in such ordinal time-series data. We propose a Bayesian inference method that utilizes an effective MCMC sampling mechanism. This method estimates model parameters and forecasts one-step-ahead AQI levels using a rolling window approach. To check the convergency of MCMC samplers, we monitor the trace and ACF plots by visual inspection and compute Geweke’s convergence diagnostic.
Simulation studies demonstrate that the proposed method provides reliable model parameter estimates, and the posterior means of these estimates are robustly close to the actual values as the sample size increases. For the empirical study, we investigate three datasets of daily AQI levels from Pingtung, Zuoying, and Miaoli stations in Taiwan. Apart from Pingtung County, the prediction outcomes demonstrate that the proposed method effectively forecasts AQI levels during the testing period. This is evident from the alignment of predicted AQI proportions with the actual proportions, especially notable at the Miaoli station. To enhance forecasting accuracy for Pingtung, it is essential to customize models that fit its unique geographical and environmental characteristics.
Aside from using parametric models, there are some machine learning methods for forecasting bounded ordinal time series data, including tree-based models, neural networks, and support vector machines. We propose the consideration of model-averaging forecasts for ordinal and bounded time series data. The advantage of averaging over multiple models is the reduced risk of relying on a single, potentially inappropriate model, which can also enhance accuracy by balancing out biases inherent in individual models. We plan to explore this aspect in our future work.

Author Contributions

Conceptualization, C.W.S.C.; Methodology, F.-C.L. and C.W.S.C.; Software, C.-Y.H.; Formal analysis, F.-C.L., C.W.S.C. and C.-Y.H.; Investigation, F.-C.L. and C.W.S.C.; Data curation, C.-Y.H.; Writing—original draft, F.-C.L., C.W.S.C. and C.-Y.H.; Writing—review & editing, F.-C.L. and C.W.S.C.; Visualization, C.W.S.C. and C.-Y.H.; Supervision, F.-C.L. and C.W.S.C.; Project administration, F.-C.L.; Funding acquisition, F.-C.L. and C.W.S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Science and Technology Council grant number NSTC 112-2118-M-035-002- and NSTC 112-2118-M-035-001-MY3.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zeger, S.L. A regression model for time series of counts. Biometrika 1988, 75, 621–629. [Google Scholar] [CrossRef]
  2. Zhu, F.; Li, Q.; Wang, D. A mixture integer-valued ARCH model. J. Stat. Plan. Inference 2010, 140, 2025–2036. [Google Scholar] [CrossRef]
  3. Ferland, R.; Latour, A.; Oraichi, D. Integer-valued GARCH process. J. Time Ser. Anal. 2006, 27, 923–942. [Google Scholar] [CrossRef]
  4. Weiß, C.H. Modelling time series of counts with overdispersion. Stat. Methods Appl. 2009, 18, 507–519. [Google Scholar] [CrossRef]
  5. Fokianos, K.; Rahbek, A.; Tjøstheim, D. Poisson autoregression. J. Am. Stat. Assoc. 2009, 104, 1430–1439. [Google Scholar] [CrossRef]
  6. Zhu, F. Modeling overdispersed or underdispersed count data with generalized Poisson integer-valued GARCH models. J. Math. Anal. Appl. 2012, 389, 58–71. [Google Scholar] [CrossRef]
  7. Chen, C.W.S.; Lee, S. Generalized Poisson autoregressive models for time series of counts. Comput. Stat. Data Anal. 2016, 99, 51–67. [Google Scholar] [CrossRef]
  8. Chen, C.W.S.; Lee, S. Bayesian causality test for integer-valued time series models with applications to climate and crime data. J. R. Stat. Soc., C: Appl. Stat. 2017, 66, 797–814. [Google Scholar] [CrossRef]
  9. Xu, H.Y.; Xie, M.; Goh, T.N.; Fu, X. A model for integer-valued time series with conditional overdispersion. Comput. Stat. Data Anal. 2012, 56, 4229–4242. [Google Scholar] [CrossRef]
  10. Chen, C.W.S.; Khamthong, K. Bayesian modelling of nonlinear negative binomial integer-valued GARCHX models. Stat. Model. 2020, 20, 537–561. [Google Scholar] [CrossRef]
  11. Lambert, D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 1992, 34, 1–14. [Google Scholar] [CrossRef]
  12. Wang, P. Markov zero-inflated Poisson regression models for a time series of counts with excess zeros. J. Appl. Stat. 2001, 28, 623–632. [Google Scholar] [CrossRef]
  13. Yau, K.K.W.; Lee, A.H.; Carrivick, P.J.W. Modeling zeroinflated count series with application to occupational health. Comput. Methods Programs Biomed. 2004, 74, 47–52. [Google Scholar] [CrossRef] [PubMed]
  14. Jazi, M.A.; Jones, G.; Lai, C.D. First-order integer valued AR processes with zero inflated Poisson innovations. J. Time Ser. Anal. 2012, 33, 954–963. [Google Scholar] [CrossRef]
  15. Chen, C.W.S.; Chiu, L.M. Ordinal time series forecasting of the air quality index. Entropy 2021, 23, 1167. [Google Scholar] [CrossRef] [PubMed]
  16. Liu, M.; Zhu, F.; Zhu, K. Modeling normalcy-dominant ordinal time series: An application to air quality level. J. Time Ser. Anal. 2022, 43, 460–478. [Google Scholar] [CrossRef]
  17. Weiß, C.H.; Jahn, M. Soft-clipping INGARCH models for time series of bounded counts. Stat. Model. 2022, 0. [Google Scholar] [CrossRef]
  18. Liu, M.; Zhu, F.; Li, J.; Sun, C.A. Systematic Review of INGARCH Models for Integer-Valued Time Series. Entropy 2023, 25, 922. [Google Scholar] [CrossRef]
  19. Weiß, C.H. An Introduction to Discrete-Valued Time Series; Wiley: Hoboken, NJ, USA, 2018. [Google Scholar] [CrossRef]
  20. Chen, C.W.S.; So, M.K.P. On a threshold heteroscedastic model. Int. J. Forecast. 2006, 22, 73–89. [Google Scholar] [CrossRef]
  21. Gelman, A.; Roberts, G.O.; Gilks, W.R. Efficient metropolis jumping rules. Bayesian Stat. 1996, 5, 599–607. [Google Scholar]
  22. Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
  23. Chiu, L.M. Air Quality Forecasting in Taiwan Based on Support Vector Machine and Statistical Models. Master’s Thesis, Feng Chia University, Taichung, Taiwan, 2020. Available online: https://hdl.handle.net/11296/ttra2z (accessed on 9 December 2023).
  24. Torgo, L. Data Mining with R, Learning with Case Studies Chapman and Hall/CRC. 2010. Available online: http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR (accessed on 20 September 2022).
  25. Pingtung County Government. “Causes and Sources of Pollution in Seasons with Poor Air Quality” Pingtung County Government, (April 22, 2019) in Chinese. 2019. Available online: https://www.pthg.gov.tw/plantou/News_Content.aspx?n=B666B8BE5F183769&sms=6B402F30807E7BB3&s=34B0170139EA89FD (accessed on 9 December 2023).
  26. Geweke, J. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments (with discussion). In Bayesian Statistics 4; Berger, J.O., Bernardo, J.M., Dawid, A.P., Smith, A.F.M., Eds.; Oxford University Press: Oxford, UK, 1992; pp. 169–193. [Google Scholar]
  27. Chib, S. Markov Chain Monte Carlo methods: Computation and inference. Handb. Econom. 2001, 5, 3569–3649. [Google Scholar]
  28. Jung, R.C.; Kukuk, M.; Liesenfeld, R. Time series of count data: Modeling, estimation and diagnostics. Comput. Stat. Data Anal. 2006, 51, 2350–2364. [Google Scholar] [CrossRef]
Figure 1. Trace plots and ACF plots of parameter estimates related to the ZOBPAR model (Model 1).
Figure 1. Trace plots and ACF plots of parameter estimates related to the ZOBPAR model (Model 1).
Entropy 26 00016 g001
Figure 2. Trace plots and ACF plots of parameter estimates related to the ZOBPARX model (Model 2).
Figure 2. Trace plots and ACF plots of parameter estimates related to the ZOBPARX model (Model 2).
Entropy 26 00016 g002
Figure 3. Time series plots of daily AQI values for Pingtung, Miaoli, and Zuoying from 30 December 2016 to 31 January 2020.
Figure 3. Time series plots of daily AQI values for Pingtung, Miaoli, and Zuoying from 30 December 2016 to 31 January 2020.
Entropy 26 00016 g003
Figure 4. Monthly AQI levels for Pingtung, Miaoli, and Zuoying from 30 December 2016 to 31 January 2020.
Figure 4. Monthly AQI levels for Pingtung, Miaoli, and Zuoying from 30 December 2016 to 31 January 2020.
Entropy 26 00016 g004
Figure 5. Time series plots of daily PRE values for Pingtung, Miaoli, and Zuoying from 30 December 2016 to 31 January 2020.
Figure 5. Time series plots of daily PRE values for Pingtung, Miaoli, and Zuoying from 30 December 2016 to 31 January 2020.
Entropy 26 00016 g005
Figure 6. Time plots and ACF plots of standardized Pearson’s residuals for: (a) Zuoying, (b) Pingtung, and (c) Miaoli.
Figure 6. Time plots and ACF plots of standardized Pearson’s residuals for: (a) Zuoying, (b) Pingtung, and (c) Miaoli.
Entropy 26 00016 g006
Table 1. Parameter estimates of the ZOBPAR model (Model 1) based on 100 replications.
Table 1. Parameter estimates of the ZOBPAR model (Model 1) based on 100 replications.
ParameterTrueMeanMedianStd 2.5 97.5
n = 500
π 1 0.010.05590.04970.03790.00350.1437
π 2 0.300.30240.30290.03340.23590.3667
α 0 0.020.05730.04980.04000.00390.1524
α 1 0.700.74500.73780.10470.55960.9759
β 1 0.200.16700.16340.07120.04130.3153
n = 1000
π 1 0.010.04040.03600.02700.00260.1032
π 2 0.300.29920.29950.02420.25120.3460
α 0 0.020.04340.03840.02860.00430.1111
α 1 0.700.74270.73890.07320.61060.8998
β 1 0.200.17470.17370.05170.07740.2786
Table 2. Parameter estimates of the ZOBPARX model (Model 2) based on 100 replications.
Table 2. Parameter estimates of the ZOBPARX model (Model 2) based on 100 replications.
ParameterTrueMeanMedianStd 2.5 97.5
n = 500
π 1 0.010.03820.03540.02240.00440.0881
π 2 0.300.31430.31430.03510.24540.3829
α 0 0.020.13100.11790.08600.00910.3286
α 1 0.700.74530.74170.11260.53510.9773
β 1 0.200.15050.14620.07280.02580.3029
γ 1 0.300.30730.29850.11090.11480.5474
n = 1000
π 1 0.010.02660.02470.01540.0030.0608
π 2 0.300.31120.31120.02500.26220.3602
α 0 0.020.09730.08850.06210.00690.2379
α 1 0.700.73970.73800.07750.59260.8965
β 1 0.200.15080.14950.05630.04580.2643
γ 1 0.300.29650.29280.07340.16270.4496
Table 3. Classification of AQI levels.
Table 3. Classification of AQI levels.
CategoryValueColorAQI Levels
00–50GreenGood
151–100YellowSatisfactory
2101–150OrangeModerately
3151 or moreRedPoor
Table 4. Summary statistics of daily AQI and PRE values.
Table 4. Summary statistics of daily AQI and PRE values.
StationMeanStdMinMax
AQI
Pingtung83.939.615.0206.0
Miaoli63.925.514.0172.0
Zuoying83.242.217.0210.0
PRE
Pingtung6.724.60.0356.5
Miaoli3.812.40.0151.0
Zuoying5.120.70.0283.5
Table 5. Accuracy percentages and penalty scores of AQI level forecasts for Pingtung, Miaoli, and Zuoying from the considered models.
Table 5. Accuracy percentages and penalty scores of AQI level forecasts for Pingtung, Miaoli, and Zuoying from the considered models.
PingtungMiaoliZuoying
ModelAccuracy (%)Penalty Accuracy (%)Penalty Accuracy (%)Penalty
ZOBPAR61.6398 85.2123 65.8366
ZOBPARX 167.4377 84.9135 68.5395
ZOBPARX 267.1416 86.8123 69.3367
ZOBPARX 365.5411 87.9111 69.9391
ZOBPARX 464.4448 85.8119 68.8388
Table 6. Parameter estimation of the three sites based on the best forecasting models.
Table 6. Parameter estimation of the three sites based on the best forecasting models.
ParameterMeanMedianStd 2.5 97.5 CD   a Ineff.   b
Zuoying
π 1 0.00190.00140.00180.00000.00660.49103.2333
π 2 0.08710.08790.02080.04390.12650.13865.1295
α 0 0.03250.03050.01560.00730.06720.09822.9496
α 1 0.63160.63120.06900.49510.76790.23773.4388
β 1 0.37870.37670.06450.25420.50520.41523.2421
γ 1 0.04590.04210.02850.00320.10950.88823.1974
Pingtung
π 1 0.00180.00130.00170.00010.00650.71783.3460
π 2 0.11140.11150.02160.07050.15530.53192.9995
α 0 0.07680.07490.02730.03030.13680.59043.3300
α 1 0.58210.58250.08550.42110.74950.35203.5807
β 1 0.41450.41300.08810.24710.58370.50173.5029
γ 1 0.00730.00540.00660.00020.02430.07593.5047
Miaoli
π 1 0.00590.00410.00600.00020.02230.47636.5837
π 2 0.39460.39420.02150.35290.43570.06303.5333
α 0 0.02830.02520.02020.00140.07790.51103.7953
α 1 0.69720.69550.06250.58050.82280.81033.4394
β 1 0.13120.12840.06670.01170.26330.10303.4457
γ 1 0.02000.01550.01660.00080.06130.11873.7250
γ 2 0.04790.04200.03380.00330.12820.29113.4330
a CD: p-values of convergence diagnostic test. b Ineff.: inefficiency factors.
Table 7. Number of days and proportions of forecasted AQI levels and true AQI levels for Pingtung, Miaoli, and Zuoying.
Table 7. Number of days and proportions of forecasted AQI levels and true AQI levels for Pingtung, Miaoli, and Zuoying.
Pingtung Miaoli Zuoying
Fitted Model0123 0123 0123
True level ( Y t )
# of days981648221 152191211 1251348422
Percentage (%)26.844.922.55.8 41.652.35.80.3 34.236.7236
ZOBPAR ( Y t ^ )
# of days1241316842 124205315 1491165149
Percentage (%)34.035.918.611.5 34.056.28.51.4 40.831.814.013.4
ZOBPARX 1 ( Y t ^ )
# of days1401176048 1331952116 1301325350
Percentage (%)38.432.116.413.2 36.453.45.84.4 35.636.214.513.7
ZOBPARX 2 ( Y t ^ )
# of days1271395841 1401951911 1381195652
Percentage (%)34.838.115.911.2 38.453.45.23.0 37.832.615.314.2
ZOBPARX 3 ( Y t ^ )
# of days1321205657 1142192012 1291325945
Percentage (%)36.232.915.315.6 31.260.05.53.3 35.336.216.212.3
ZOBPARX 4 ( Y t ^ )
# of days1231336148 130197299 1381275941
Percentage (%)33.736.416.713.2 35.654.07.92.5 37.834.816.211.2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, F.-C.; Chen, C.W.S.; Ho, C.-Y. Bayesian Forecasting of Bounded Poisson Distributed Time Series. Entropy 2024, 26, 16. https://doi.org/10.3390/e26010016

AMA Style

Liu F-C, Chen CWS, Ho C-Y. Bayesian Forecasting of Bounded Poisson Distributed Time Series. Entropy. 2024; 26(1):16. https://doi.org/10.3390/e26010016

Chicago/Turabian Style

Liu, Feng-Chi, Cathy W. S. Chen, and Cheng-Ying Ho. 2024. "Bayesian Forecasting of Bounded Poisson Distributed Time Series" Entropy 26, no. 1: 16. https://doi.org/10.3390/e26010016

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop