Abstract
This research models and forecasts bounded ordinal time series data that can appear in various contexts, such as air quality index (AQI) levels, economic situations, and credit ratings. This class of time series data is characterized by being bounded and exhibiting a concentration of large probabilities on a few categories, such as states 0 and 1. We propose using Bayesian methods for modeling and forecasting in zero-one-inflated bounded Poisson autoregressive (ZOBPAR) models, which are specifically designed to capture the dynamic changes in such ordinal time series data. We innovatively extend models to incorporate exogenous variables, marking a new direction in Bayesian inferences and forecasting. Simulation studies demonstrate that the proposed methods accurately estimate all unknown parameters, and the posterior means of parameter estimates are robustly close to the actual values as the sample size increases. In the empirical study we investigate three datasets of daily AQI levels from three stations in Taiwan and consider five competing models for the real examples. The results exhibit that the proposed method reasonably predicts the AQI levels in the testing period, especially for the Miaoli station.
1. Introduction
Time series of counts are non-negative integer chronological data that are widely investigated in various fields, such as epidemiology, economics, meteorology, and crime. These applications follow Zeger [1] who presented a log-linear regression model to analyze the time series of polio cases in the United States. The literature commonly uses Poisson or binomial regression models to accommodate the characteristic of integer-valued data, but such models are no longer suitable once we deal with time series datasets. Zhu et al. [2] proposed a mixture integer-valued autoregressive conditional heteroscedastic (INARCH) model to deal with computer-aided dispatch calls data. After Ferland et al. [3] proposed the integer-valued generalized autoregressive conditional heteroscedastic (INGARCH) model for time series of counts, numerous extensions and applications of INGARCH models have emerged. Notable examples include those in Weiß [4], Fokianos et al. [5], Zhu [6], Chen and Lee [7], and Chen and Lee [8]. Specifically, Xu et al. [9] introduced a new dispersed INARCH model to examine the time series of dengue cases in Singapore. Chen and Lee [8] investigated the causal relationship between climate and criminal behavior using INGARCHX models to reflect one or more exogenous series. Moreover, Chen and Khamthong [10] proposed nonlinear INGARCHX models for weekly dengue case counts in Thailand, incorporating two climatological covariates: temperature and precipitation.
Many time series of counts often have a large number of zero values like when dealing with rare diseases or rare events, resulting in excessive series dispersion. Therefore, a large strand of the literature discusses the zero-count (zero-inflation) approach to capturing such phenomena. Lambert [11] presented a count data model, the zero-inflated Poisson (ZIP) model, whose observed values are random events with a large number of zero count data in a unit of time. The ZIP model is a mixed model consisting of a Poisson assignment and zero probability, which are commonly used in quality control and accidents (e.g., Wang [12], Yau et al. [13], and Jazi et al. [14]).
There is another type of integer-valued time series, called categorical time series. Forecasting bounded ordinal time series data is challenging due to their discrete and constrained nature. Bounded ordinal data refer to data with a natural order, but limited to a specific range, such as survey responses on a scale from 1 to K, where . One example is the air quality index (AQI), in which the AQI value is divided into six levels, and the data are mostly between levels 1 and 2. Related research on AQI by Chen and Chiu [15] and Liu et al. [16] indicates that AQI data belong to bounded time series data. The time series of AQI levels are ordinal, because the class increases as the AQI interval value increases.
Unlike the classical Poisson distribution and the zero-inflated Poisson (ZIP) distribution, the zero-one-inflated bounded (ZOB) Poisson distribution model, as proposed by Liu et al. [16], is finite across states . This makes it more suitable for fitting restricted states to match the levels of the data. Additionally, Weiß and Jahn [17] introduced soft-clipping binomial INGARCH models as time series models for bounded counts, which also accommodate negative autocorrelations. Liu et al. [18] presented a review of the developments in INGARCH models over the past five years, focusing on unbounded and bounded non-negative counts.
The ZOB Poisson distributed model is bounded and primarily concentrated on the categories of state 0 and state 1, which occur with larger probabilities compared with other categories. The ZOB Poisson distribution to study normalcy-dominant ordinal time series is as follows:
where and are the inflated parameters for states 0 and 1. The constraint , , is the intensity parameter, and the integer is a given upper bound.
In the ZOB Poisson distribution, the case of is called a bounded Poisson distribution. When , the phenomenon of inflation that may occur is in state 0 or state 1, thus allowing the datapoint to fit a normalcy-dominant ordinal time series with two possible normal states, 0 and 1. In this ZOB Poisson autoregressive (ZOBPAR) model, the intensity function adopts an autoregressive structure so that varies with time.
Instead of employing the method of maximum likelihood estimator as in Liu et al. [16], the present study uses Bayesian inference with the Markov chain Monte Carlo (MCMC) method to estimate unknown parameters. The advantages of using Bayesian methods include: (1) allowing the incorporation of prior knowledge or beliefs to form a prior distribution that more accurately describes the uncertainty of parameter estimates; (2) enabling simultaneous analysis of all unknown parameters and forecasting; and (3) providing probabilities that parameters fall within credible intervals, which offer a more intuitive and direct way to understand and communicate the uncertainty of parameter estimates.
This study makes three contributions to the existing literature. (1) We incorporate exogenous variables to develop the ZOBPARX model, thus accommodating more flexible situations. (2) We employ Bayesian parameter estimation methods for quantifying uncertainty. (3) We predict one-step-ahead categories for out-of-sample forecasts. The aim is to demonstrate that the model from the ZOBPARX family can effectively capture the dynamic relationships in ordinal data and provide reasonable predictions for the out-of-sample (test) period. To our knowledge, no Bayesian approach or forecast evaluation is currently available for this proposed model.
This paper proceeds as follows. Section 2 reviews the methodologies, the MCMC sampling scheme, and forecasting method. Section 3 explains the results of simulation studies and the accuracy of estimates. Section 4 provides an empirical study of AQI-level forecasts and evaluates forecast accuracy. Finally, Section 5 offers concluding remarks.
2. ZOBPAR and ZOBPARX Models
We denote the bounded Poisson distribution by , where is the mean of Poisson, and K is an upper bound of category. The bounded Poisson distribution equals the ZOB Poisson distribution in Equation (1) with . Let be an independent and identically distributed (i.i.d.) sequence having the following probability distribution:
From the definition by Liu et al. [16], the ordinal time series data are then said to follow a ZOBPAR model if:
with and:
where is the available information up to time t, and satisfying Equation (2) is independent of . To achieve the stationary condition of Equation (3), Liu et al. [16] provide a sufficient condition as:
Note that Equation (3) is similar to the INGARCH(1,1) model of Weiß [19] for capturing the dynamic structure of .
When follows the ZOBPAR model, we can conduct the conditional probability of as:
where , and its conditional mean and variance are:
where the function , and and are its first and second derivatives, respectively.
In addition to considering the effects of exogenous variables, we extend Equation (3) by incorporating these variables, denoted as . We then define the ZOBPAR model with exogenous variables (ZOBPARX) as:
where is the parameter of the jth exogenous variable, and we restrict to ensure non-negativeness of . S is the number of exogenous variables.
3. Bayesian Inference and Forecasting
Let be the unknown parameter vector of the ZOBPAR model, where , and . Based on Equation (4), the log-likelihood function of is:
and
where is defined by Equation (3) or (5), and .
Using Bayes’ rule, we multiply the likelihood and the priors to give the conditional posterior probability as follows:
where , and . Here, denotes each parameter in , is the prior density of , and is the vector of all model parameters except for . We choose noninformative priors for all parameters and restrict the parameters to satisfy the specified region. , and , where and .
3.1. MCMC Sampling Schemes
We designed an MCMC sampling scheme to obtain the posterior estimates of for the ZOBPAR model using an adaptive Metropolis-Hastings (MH) algorithm by Chen and So [20]. This approach combines the random walk Metropolis (RW-M) algorithm and the independent kernel MH (IK-MH) algorithm, with a total number of iterations N and burn-in iterations M. The steps of the MCMC sampling scheme are as follows.
- Step 1:
- Give initial values of .
- Step 2:
- At the ith iteration, update by the MH algorithm conditional on . If , then the RW-M algorithm is used; otherwise, the IK-MH algorithm is employed.
- Step 3:
- Similarly, at the ith iteration, is updated by the MH algorithm conditional on . If , then the RW-M algorithm is used; otherwise, the IK-MH algorithm is employed.
- Step 4:
- Repeat Step 2 and Step 3 until the number of iterations equals N.
The procedures for the RW-M and IK-MH algorithms for run as follows.
- (i)
- The steps of RW-M for from are as follows.
- Step 1:
- , where , I is identity matrix, and is the estimate of at the (i− 1)th iteration.
- Step 2:
- Accept as with acceptance probability:where is the conditional posterior distribution of . If , set ; otherwise, set .
- Step 3:
- Repeat Step 1 and Step 2 for each MCMC iteration during the burn-in iterations.
Note that the stepsize c in Step 1 controls the acceptance rate for .According to Gelman et al. [21], a suitable value of the acceptance rate for good convergence is between and . - (ii)
- The steps of IK-MH for from are as follows:
- Step 1:
- , where , with and as the sample mean and sample variance of the estimates of from the burn-in samples.
- Step 2:
- Update as with acceptance probability:where g is a Gaussian proposal density with mean and variance . If , set ; otherwise, set .
- Step 3:
- Repeat Step 1 and Step 2 until the total number of iterations is N.
The RW-M and IK-MH procedures for are similar to the procedures of .
3.2. Bayesian Forecasting
In the empirical illustration, we conduct one-step-ahead forecasting to predict . The Bayesian forecasting procedure runs as follows (for example, ):
- Step 1:
- Obtain the posterior means of parameters , where and ,and then put into Equation (3) to calculate
- Step 2:
- Compute the conditional probability of:where by and .
- Step 3:
- Generate a random number .
- (1)
- If , then ; else, move to the next statement.
- (2)
- If , then ; else, move to the next statement.
- (3)
- If , then ; else, move to the next statement.
- (4)
- If , then ; else, move to the next statement.
- (5)
- .
We employ this procedure to generate one-step-ahead forecasts for bounded ordinal time series data.
4. Simulation Study
In this section we conduct a simulation study to examine the established Bayesian MCMC method. There are two models: Model 1 is a ZOBPAR model, and Model 2 is a ZOBPARX model, specified as follows to generate simulated data with sample size and :
Model 1:
Model 2:
We generate in Model 2 from a Gamma distribution with . Both models follow a bounded Poisson distribution with bound .
A computational issue arises with a slow convergence rate of parameter estimates, when we simultaneously estimate a set of unknown parameters. To implement the estimation of model parameters, we use the Bayesian method with a designed MCMC sampling. We employ two sampling mechanisms to speed up the convergence of MCMC sampling. First, we use an adaptive MCMC sampling method, which combines the RW-M algorithm and the IK-MH algorithm, as mentioned in the Section 3.1. Second, due to autocorrelation, we choose every five samplers as a thinning chain in MCMC outputs. The total number of MCMC iterations is 20,000, which includes a burn-in period of 8000 iterations. Based on R codes, the computational times for the ZOBPAR model with sample sizes and are 159 and 309 s, respectively. For the ZOBPARX model, the CPU times are approximately 241 and 464 s. Parameter estimations are efficiently completed in under eight minutes, and all computations are performed on a computer equipped with an i7-11700 CPU and 64 GB of RAM.
To examine the convergence of MCMC, we monitor the trace and autocorrelation function (ACF) plots of MCMC samplers during the after burn-in iterations. We provide trace plots and ACF plots for MCMC samples based on simulated data from Model 1 and Model 2 with sample size (see Figure 1 and Figure 2). We expect that the trace plots of all parameters randomly vary between a reasonable constant range, and that the ACF plots present no autocorrelation observed in the MCMC samplers. This demonstrates that we have converged MCMC samplers, and that the parameter estimates are reliable.
Figure 1.
Trace plots and ACF plots of parameter estimates related to the ZOBPAR model (Model 1).
Figure 2.
Trace plots and ACF plots of parameter estimates related to the ZOBPARX model (Model 2).
The results in Table 1 reveal that the averages of posterior means based on 100 replications are close to the true values, and Bayesian credible intervals ( and ) can accurately cover the corresponding true values. This confirms that the designed MCMC method provides accurate parameter estimates. To examine the consistency of parameter estimates, we offer the results of parameter estimates for different sample sizes and 1000. The results indicate that the proposed Bayesian method presents accurate parameter estimates with small standard deviations as the sample size increases.
Table 1.
Parameter estimates of the ZOBPAR model (Model 1) based on 100 replications.
The results of parameter estimates for the ZOBPARX model (Model 2) are presented in Figure 2 and Table 2. All MCMC samples converge, and the posterior means are close to the true values. Again, all parameter estimates are close to the true values, while the posterior standard deviations are small as the sample size increases.
Table 2.
Parameter estimates of the ZOBPARX model (Model 2) based on 100 replications.
5. Empirical Application
In order to demonstrate our proposed method, we investigate daily AQI levels from the weather stations of Pingtung, Miaoli, and Zuoying in Taiwan. We collect daily AQI levels for each station from 30 December 2016 to 31 January 2020 for a total of 1129 observations. To evaluate the forecasting performance, we separate the whole sample period into two periods: the training period with 764 observations for in-sample model estimation and the testing period with 365 days for out-of-sample forecasts. By a rolling window approach, we conduct one-step ahead forecasts for 365 days and evaluate the forecast performance by computing the accuracy of AQI level forecasts.
Precipitation can effectively reduce particulate matter concentrations (PM10 and PM2.5) in the air. When it rains, these particles are captured by raindrops and carried to the ground. This process can lead to a significant improvement in AQI, particularly in reducing particulate pollution. We thus treat daily accumulated precipitation (PRE) and the winter dummy variable as exogenous variables and consider different combinations of ZOBPARX models to study the effects of exogenous variables. We define daily accumulated PRE as an exogenous variable , which negatively correlates with the AQI level. We propose two distinct transformations to fulfil the coefficient constraint () and to address the negative correlation between yesterday’s PRE and today’s AQI level. The first transformation involves taking the reciprocal of PRE (TF1_PRE), set as , and the second involves computing the exponential of the negative PRE (TF2_PRE), designated as .
These transformations of can ensure having positive estimates of the coefficients in the ZOBPARX model. To investigate the effect of seasonality, we consider the exogenous variables of the winter dummy variable and month dummy variables. For the winter dummy variable, we define as:
According to the definitions of exogenous variables, we consider following ZOBPARX models with different combinations of exogenous variables for three datasets.
- ZOBPARX 1: (exogenous variable: TF1_PRE)
- ZOBPARX 2: (exogenous variable: TF2_PRE)
- ZOBPARX 3: (exogenous variables: TF1_PRE and winter dummy)
- ZOBPARX 4: (exogenous variables: TF2_PRE and winter dummy)
In the process of data collection, we faced problems of missing observations in AQI and weather covariate. To avoid loss of information, we adopt the k-nearest neighbors (knn) algorithm by Cover and Hart [22] to impute the missing values, which takes the same approach as Chen and Chiu [15].
Chiu [23] conducted an experiment on a selected period of the AQI time series without missing values from three randomly chosen sites. Chiu [23] then introduced missing values at every 10th datapoint and imputed these using the KNN-imputation method from the R package “DMwR” [24]. This process compares different h values for AQI prediction, using some variables to minimize the mean absolute error (MAE) and root mean squared error (RMSE). The results suggest that employing four days () with data on rainfall, temperature, wind direction, PM2.5, and seasonal dummy variables, which are closest to the day with missing AQI, and then taking the weighted average of the AQI values from these four days, serves as an effective imputation for the missing AQI. Following the same line, we impute missing values and refer to the results of Chiu [23] and Chen and Chiu [15] to set for the imputation of missing values in PRE and AQI, as shown below.
- (1)
- PRE with missing values: If the PRE value is missing at day t, then we pick the valid data of the nearest station to impute;
- (2)
- AQI with missing values: In the knn algorithm, we set to impute the missing AQI values by the corresponding weather data in the nearest four days concerning PRE, daily average temperature, daily wind direction, PM2.5, and seasonal dummy value closest to the day with missing AQI; the imputation of a missing AQI value is the weighted average of the AQI values of these four days. The weights decrease as the distances to their neighbor increase, and we use a Gaussian kernel function to take the weights from their distances. For more details, one can refer to Torgo [24].
Following the U.S. EPA’s classification of AQI levels, we classify AQI values into four levels, each represented by different colors for visual identification, as in Table 3. Figure 3 presents the time series plots of daily AQI values from 30 December 2016 to 31 January 2020 for Pingtung, Miaoli, and Zuoying. Observing these time series plots, the changes of AQI values in Pingtung and Zuoying are more volatile than in Miaoli, and the AQI values are low from May to September for each year at each station. A large proportion of AQI values under 100 at the three stations distinctly demonstrates the phenomenon of zero-one inflation in the data concerning AQI levels. Figure 4 plots the proportions of AQI levels by stacked bar charts for each month and each station. It is obvious that the levels of AQI are quite different among months, while the period from June to August has better air quality, with large proportions of 0 and 1 levels of AQI.
Table 3.
Classification of AQI levels.
Figure 3.
Time series plots of daily AQI values for Pingtung, Miaoli, and Zuoying from 30 December 2016 to 31 January 2020.
Figure 4.
Monthly AQI levels for Pingtung, Miaoli, and Zuoying from 30 December 2016 to 31 January 2020.
The time series plots of daily PRE for the three stations, presented as the exogenous variable PRE in Figure 5, indicate that the rainy season occurs periodically from May to September. The highest daily cumulative PRE typically occurs during July and August. We provide the summary statistics of AQI and PRE for the three stations in Table 4. The means and standard deviations of AQI of Pingtung and Zuoying are both larger than their values for Miaoli. This is consistent with the findings in Figure 3. The maximum values of AQI of Pingtung and Zuoying both exceed 200, which is at the level of “poor”. Similar to the PRE values, Pingtung and Zuoying have larger mean values and standard deviations of PRE than Miaoli. We observe that the weather is relatively unstable in Pingtung and Zuoying.
Figure 5.
Time series plots of daily PRE values for Pingtung, Miaoli, and Zuoying from 30 December 2016 to 31 January 2020.
Table 4.
Summary statistics of daily AQI and PRE values.
We employ MCMC methods for parameter estimation and one-step-ahead AQI forecasts during the out-of-sample period for each dataset. To evaluate the performance of our proposed method in forecasting AQI levels, we fit a ZOBPAR model and four ZOBPARX models, using each to perform one-step-ahead forecasts for 365 days through the rolling window method. To the best of our knowledge, apart from the ZOBPAR model, no other model can investigate zero-one inflation with a bounded Poisson distribution. Therefore, we consider a ZOBPAR model and four ZOBPARX models for the comparative analysis. Table 5 presents two evaluation metrics—accuracy and penalty—used to assess the AQI level forecasts over a period of 365 days for each dataset, as predicted by these five different models.
Table 5.
Accuracy percentages and penalty scores of AQI level forecasts for Pingtung, Miaoli, and Zuoying from the considered models.
Categories 0 and 1 of AQI levels both represent satisfactory air quality and have no effect on human health, whereas other categories are harmful to human health. We treat all categories split into two levels for calculating the accuracy of AQI-level forecasts. To evaluate the effectiveness of model prediction, we propose a penalty mechanism that reflects a numerical ‘cost’ or ‘penalty’ for each type of misclassification in our categorical data. The scores of penalty in Table 5 are the sum of daily penalties during the forecasting period. The lower the penalty score, the better the model.
We assign a numerical ‘cost’ or ‘penalty’ to each type of misclassification in our categorical data, adopting a concept akin to weighted absolute error. This approach involves defining a weight or cost for each type of error (e.g., predicting Category A when the actual category is B) and then calculating an overall score based on these weights. If there are differences between actual AQI levels and predicted AQI levels, denoted as , are −1, −2, 1, or 2, then we design a penalty mechanism with corresponding weights of 1, 2, 4, and 8, respectively. The aim of this method is to ensure that the weights accurately reflect the relative importance of each category, with the underestimated category receiving a higher penalty than the overestimated one, and to accurately represent the cost of each type of error in our specific context.
For the Pingtung site, the ZOBPARX 1 model shows the highest accuracy (67.4%), indicating it most frequently predicts the AQI levels correctly. ZOBPARX 1 also leads with a lower penalty score than the other models. For the Miaoli site, the ZOBPARX 3 model has the highest accuracy (87.9%) and the lowest penalty. For the Zuoying site, the ZOBPARX 3 model shows the highest accuracy (69.9%) but the ZOBPAR model provides the lowest penalty. The ZOBPARX 2 model performs as the second-best model, provideing reasonable accuracy and a low penalty.
It appears that the models generally perform well in forecasting for Miaoli’s air quality. In contrast, these models exhibit lower accuracy in Pingtung and Zuoying. In summary, different models excel in different locations, highlighting the importance of selecting location-specific models for accurate AQI forecasting.
The Environmental Protection Bureau of the Pingtung County government states that the county’s geographic location at the southernmost end, often positioned downwind of prevailing winds, contributes to its air pollution issues. The weaker wind speeds in winter, fewer days of rainfall, and poor atmospheric dispersion conditions, compounded by the region’s topography that can create localized eddy currents, are all factors that lead to higher concentrations of air pollutants in Pingtung County during the autumn and winter seasons, often exceeding standard levels [25]. We need to tailor models for local geographical and environmental characteristics to improve forecast accuracy for specific regions like Pingtung.
For the three datasets, we present the posterior estimates of parameters for the best performing models in Table 6 and also provide convergence diagnostic checking, the convergence diagnostic (CD) test, and inefficiency factors (Ineff.) to demonstrate converged parameter estimates. The CD test introduced by Geweke [26] has a p-value greater than 0.05, and the Ineff. of Chib [27] has a small value that is far less than MCMC iterations; both reveal that the chain of MCMC samples converges. The last two columns of Table 6 present the p-values of CD tests and the Ineff. values. All p-values of CD tests are greater than 0.05, and all Ineff. values are far smaller than MCMC iterations. This means that all parameter estimates converge and are reliable for making inferences.
Table 6.
Parameter estimation of the three sites based on the best forecasting models.
To check the adequacy of the fitted model, we compute standardized Pearson’s residuals by Jung et al. [28] as follows:
For Zuoying, in Table 6 we observe that the posterior estimate of in the ZOBPARX 2 model is larger than and . This indicates that the AQI level of the previous day has a significant effect on the current day’s mean AQI level. The estimates of the probabilities and show that is larger than in the case of Zuoying. Focusing on the parameters of exogenous variables, the parameter for in the ZOBPARX 2 model for Pingtung has an estimate of 0.0457.
For Pingtung, the ZOBPARX 1 model produces more accurate predictions among the competing models. Bayesian parameter estimations for the ZOBPARX 1 model are presented in Table 6. We also observe that the posterior estimate of has a larger magnitude than that of and , and again the estimate of is greater than at this site.
For Miaoli, the ZOBPARX 3 model is the best-performing model with the highest accuracy among the competing models. The parameter estimation results for the ZOBPARX 3 model are presented in Table 6. The posterior estimate of is again larger than both and , and the estimates of and reveal that is larger than in the Miaoli dataset. When examining the parameters of exogenous variables, the parameters and for and in the ZOBPARX 3 model for Miaoli demonstrate a much smaller magnitude on AQI levels, with estimates of 0.0200 and 0.0478, respectively.
These findings underscore the significance of in AQI forecasting models for each site, emphasizing the crucial influence of the previous day’s AQI levels. Consistently across the three sites, the estimate remains . Furthermore, PRE or the winter dummy variable plays an important role in the forecasts.
Figure 6 displays the time plots and ACF plots of the standardized Pearson’s residuals for the three sites, derived using the most accurate forecasting models. The diagnostic checking results suggest that the proposed models adequately capture the changes in AQI levels at these sites.
Figure 6.
Time plots and ACF plots of standardized Pearson’s residuals for: (a) Zuoying, (b) Pingtung, and (c) Miaoli.
To gain a detailed understanding of the performances of AQI level forecasts in each dataset, we compute the proportions of forecasted AQI levels and compare them with the actual proportions of AQI levels observed during the 365-day out-of-sample forecasting period, as shown in Table 7. Focusing on the results of Miaoli, the best model obtained in Table 5 provides accurate results on AQI level forecasts, with the proportions of forecasted levels close to the true proportions. In Pingtung, due to local geographical and environmental conditions, even the best forecasting model does not perform well in terms of the proportions of forecasted AQI levels.
Table 7.
Number of days and proportions of forecasted AQI levels and true AQI levels for Pingtung, Miaoli, and Zuoying.
6. Conclusions
This paper presents the issues of modeling and forecasting bounded ordinal time series data with a special focus on AQI levels. The data are bounded and predominantly concentrated on a few categories, such as states 0 and 1, which occur with high probabilities. We demonstrate ZOBPAR models, both with and without exogenous variables, in order to capture the dynamic changes in such ordinal time-series data. We propose a Bayesian inference method that utilizes an effective MCMC sampling mechanism. This method estimates model parameters and forecasts one-step-ahead AQI levels using a rolling window approach. To check the convergency of MCMC samplers, we monitor the trace and ACF plots by visual inspection and compute Geweke’s convergence diagnostic.
Simulation studies demonstrate that the proposed method provides reliable model parameter estimates, and the posterior means of these estimates are robustly close to the actual values as the sample size increases. For the empirical study, we investigate three datasets of daily AQI levels from Pingtung, Zuoying, and Miaoli stations in Taiwan. Apart from Pingtung County, the prediction outcomes demonstrate that the proposed method effectively forecasts AQI levels during the testing period. This is evident from the alignment of predicted AQI proportions with the actual proportions, especially notable at the Miaoli station. To enhance forecasting accuracy for Pingtung, it is essential to customize models that fit its unique geographical and environmental characteristics.
Aside from using parametric models, there are some machine learning methods for forecasting bounded ordinal time series data, including tree-based models, neural networks, and support vector machines. We propose the consideration of model-averaging forecasts for ordinal and bounded time series data. The advantage of averaging over multiple models is the reduced risk of relying on a single, potentially inappropriate model, which can also enhance accuracy by balancing out biases inherent in individual models. We plan to explore this aspect in our future work.
Author Contributions
Conceptualization, C.W.S.C.; Methodology, F.-C.L. and C.W.S.C.; Software, C.-Y.H.; Formal analysis, F.-C.L., C.W.S.C. and C.-Y.H.; Investigation, F.-C.L. and C.W.S.C.; Data curation, C.-Y.H.; Writing—original draft, F.-C.L., C.W.S.C. and C.-Y.H.; Writing—review & editing, F.-C.L. and C.W.S.C.; Visualization, C.W.S.C. and C.-Y.H.; Supervision, F.-C.L. and C.W.S.C.; Project administration, F.-C.L.; Funding acquisition, F.-C.L. and C.W.S.C. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by National Science and Technology Council grant number NSTC 112-2118-M-035-002- and NSTC 112-2118-M-035-001-MY3.
Data Availability Statement
The data presented in this study are available upon request from the corresponding author.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Zeger, S.L. A regression model for time series of counts. Biometrika 1988, 75, 621–629. [Google Scholar] [CrossRef]
- Zhu, F.; Li, Q.; Wang, D. A mixture integer-valued ARCH model. J. Stat. Plan. Inference 2010, 140, 2025–2036. [Google Scholar] [CrossRef]
- Ferland, R.; Latour, A.; Oraichi, D. Integer-valued GARCH process. J. Time Ser. Anal. 2006, 27, 923–942. [Google Scholar] [CrossRef]
- Weiß, C.H. Modelling time series of counts with overdispersion. Stat. Methods Appl. 2009, 18, 507–519. [Google Scholar] [CrossRef]
- Fokianos, K.; Rahbek, A.; Tjøstheim, D. Poisson autoregression. J. Am. Stat. Assoc. 2009, 104, 1430–1439. [Google Scholar] [CrossRef]
- Zhu, F. Modeling overdispersed or underdispersed count data with generalized Poisson integer-valued GARCH models. J. Math. Anal. Appl. 2012, 389, 58–71. [Google Scholar] [CrossRef]
- Chen, C.W.S.; Lee, S. Generalized Poisson autoregressive models for time series of counts. Comput. Stat. Data Anal. 2016, 99, 51–67. [Google Scholar] [CrossRef]
- Chen, C.W.S.; Lee, S. Bayesian causality test for integer-valued time series models with applications to climate and crime data. J. R. Stat. Soc., C: Appl. Stat. 2017, 66, 797–814. [Google Scholar] [CrossRef]
- Xu, H.Y.; Xie, M.; Goh, T.N.; Fu, X. A model for integer-valued time series with conditional overdispersion. Comput. Stat. Data Anal. 2012, 56, 4229–4242. [Google Scholar] [CrossRef]
- Chen, C.W.S.; Khamthong, K. Bayesian modelling of nonlinear negative binomial integer-valued GARCHX models. Stat. Model. 2020, 20, 537–561. [Google Scholar] [CrossRef]
- Lambert, D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 1992, 34, 1–14. [Google Scholar] [CrossRef]
- Wang, P. Markov zero-inflated Poisson regression models for a time series of counts with excess zeros. J. Appl. Stat. 2001, 28, 623–632. [Google Scholar] [CrossRef]
- Yau, K.K.W.; Lee, A.H.; Carrivick, P.J.W. Modeling zeroinflated count series with application to occupational health. Comput. Methods Programs Biomed. 2004, 74, 47–52. [Google Scholar] [CrossRef] [PubMed]
- Jazi, M.A.; Jones, G.; Lai, C.D. First-order integer valued AR processes with zero inflated Poisson innovations. J. Time Ser. Anal. 2012, 33, 954–963. [Google Scholar] [CrossRef]
- Chen, C.W.S.; Chiu, L.M. Ordinal time series forecasting of the air quality index. Entropy 2021, 23, 1167. [Google Scholar] [CrossRef] [PubMed]
- Liu, M.; Zhu, F.; Zhu, K. Modeling normalcy-dominant ordinal time series: An application to air quality level. J. Time Ser. Anal. 2022, 43, 460–478. [Google Scholar] [CrossRef]
- Weiß, C.H.; Jahn, M. Soft-clipping INGARCH models for time series of bounded counts. Stat. Model. 2022, 0. [Google Scholar] [CrossRef]
- Liu, M.; Zhu, F.; Li, J.; Sun, C.A. Systematic Review of INGARCH Models for Integer-Valued Time Series. Entropy 2023, 25, 922. [Google Scholar] [CrossRef]
- Weiß, C.H. An Introduction to Discrete-Valued Time Series; Wiley: Hoboken, NJ, USA, 2018. [Google Scholar] [CrossRef]
- Chen, C.W.S.; So, M.K.P. On a threshold heteroscedastic model. Int. J. Forecast. 2006, 22, 73–89. [Google Scholar] [CrossRef]
- Gelman, A.; Roberts, G.O.; Gilks, W.R. Efficient metropolis jumping rules. Bayesian Stat. 1996, 5, 599–607. [Google Scholar]
- Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
- Chiu, L.M. Air Quality Forecasting in Taiwan Based on Support Vector Machine and Statistical Models. Master’s Thesis, Feng Chia University, Taichung, Taiwan, 2020. Available online: https://hdl.handle.net/11296/ttra2z (accessed on 9 December 2023).
- Torgo, L. Data Mining with R, Learning with Case Studies Chapman and Hall/CRC. 2010. Available online: http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR (accessed on 20 September 2022).
- Pingtung County Government. “Causes and Sources of Pollution in Seasons with Poor Air Quality” Pingtung County Government, (April 22, 2019) in Chinese. 2019. Available online: https://www.pthg.gov.tw/plantou/News_Content.aspx?n=B666B8BE5F183769&sms=6B402F30807E7BB3&s=34B0170139EA89FD (accessed on 9 December 2023).
- Geweke, J. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments (with discussion). In Bayesian Statistics 4; Berger, J.O., Bernardo, J.M., Dawid, A.P., Smith, A.F.M., Eds.; Oxford University Press: Oxford, UK, 1992; pp. 169–193. [Google Scholar]
- Chib, S. Markov Chain Monte Carlo methods: Computation and inference. Handb. Econom. 2001, 5, 3569–3649. [Google Scholar]
- Jung, R.C.; Kukuk, M.; Liesenfeld, R. Time series of count data: Modeling, estimation and diagnostics. Comput. Stat. Data Anal. 2006, 51, 2350–2364. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).