Next Article in Journal
Estimating the Number of Communities in Weighted Networks
Next Article in Special Issue
Ruin Analysis on a New Risk Model with Stochastic Premiums and Dependence Based on Time Series for Count Random Variables
Previous Article in Journal
Tsallis Entropy of a Used Reliability System at the System Level
Previous Article in Special Issue
Two Features of the GINAR(1) Process and Their Impact on the Run-Length Performance of Geometric Control Charts
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Time Series of Counts under Censoring: A Bayesian Approach

1
Faculdade de Engenharia, Universidade do Porto, CIDMA, 4200-465 Porto, Portugal
2
Faculdade de Economia, Universidade do Porto, LIADD-INESC TEC, 4200-464 Porto, Portugal
3
Departamento de Matemática, Universidade de Aveiro, CIDMA, 3810-193 Aveiro, Portugal
4
School of Management, University of Liverpool, Liverpool L69 3BX, UK
*
Author to whom correspondence should be addressed.
Entropy 2023, 25(4), 549; https://doi.org/10.3390/e25040549
Submission received: 8 January 2023 / Revised: 20 March 2023 / Accepted: 21 March 2023 / Published: 23 March 2023
(This article belongs to the Special Issue Discrete-Valued Time Series)

Abstract

:
Censored data are frequently found in diverse fields including environmental monitoring, medicine, economics and social sciences. Censoring occurs when observations are available only for a restricted range, e.g., due to a detection limit. Ignoring censoring produces biased estimates and unreliable statistical inference. The aim of this work is to contribute to the modelling of time series of counts under censoring using convolution closed infinitely divisible (CCID) models. The emphasis is on estimation and inference problems, using Bayesian approaches with Approximate Bayesian Computation (ABC) and Gibbs sampler with Data Augmentation (GDA) algorithms.

1. Introduction

Observations collected over time or space are usually correlated rather than independent. Time series are often observed with data irregularities such as missing values or detection limits. For instance, a monitoring device may have a technical detection limit and it records the limit value when the true value exceeds/precedes the detection limit. Such data is called censored (type 1) data and are common in environmental monitoring, physical sciences, business and economics. In particular, in the context of time series of counts, censored data arise in call centers. In fact, the demand measured by the number of calls is limited by the number of operators. When the number of calls is higher than the number of operators the data is right censored and the call center incurs under-staffing and poor service to the costumers.
The main consequence of neglecting censoring in the time series analysis is the loss of information that is reflected in biased and inconsistent estimators and altered serial correlation. These consequences can be summarized as problems in inference that lead to model misspecification, biased parameter estimation, and poor forecasts.
These problems have been solved in regression settings (i.i.d.) and partially solved for Gaussian time series (see for instance [1,2,3,4,5,6,7]). However, the problem of modelling time series under censoring in the context of time series of counts has, as yet, received little attention in the literature even though its relevance for inference. Count time series occur in many areas such as telecommunications, actuarial science, epidemiology, hydrology and environmental studies where the modelling of censored data may be invaluable in risk assessment.
In the context of time series of counts, Ref. [8] deal with correlated under-reported data through INAR(1)-hidden Markov chain models. A naïve method of parameter estimation was proposed, jointly with the maximum likelihood method based on a revised version of the forward algorithm. Additionally, Ref. [9] propose a random-censoring Poisson model for under-reported data, which accounts for the uncertainty about both the count and the data reporting processes.
Here, the problem of modelling count data under censoring is considered under a Bayesian perspective. In this paper, we consider a general class of convolution closed infinitely divisible (CCID) models as proposed by [10].
We investigate two natural approaches to analyse censored convolution closed infinitely divisible models of first order, CCID(1), using the Bayesian framework: the Approximate Bayesian Computation (ABC) methodology and the Gibbs sampler with Data Augmentation (GDA).
Since the CCID(1) under censoring presents an intractable likelihood, we resort to the Approximate Bayesian Computation methodology for estimating the model parameters. The presupposed model is simulated by using sample parameters taken from the prior distribution, then a distance between the simulated dataset and the observations is computed and when the simulated dataset is very close to the observed, the corresponding parameter samples are accepted as part of the posterior.
In addition, a widely used strategy to deal with censored data is to fill in censored data in order to create a data-augmented (complete) dataset. When the data-augmented posterior and the conditional pdf of the latent process are both available in a tractable form, the Gibbs sampler allows us to sample from the posterior distribution of the parameters of the complete dataset. This methodology is called Gibbs sampler with Data Augmentation (GDA). Here, a modified GDA, in which the data augmentation is achieved by multiple sampling of the latent variables from the truncated conditional distributions (GDA-MMS), is adopted.
The Poisson integer-valued autoregressive models of first-order, PoINAR(1), is one of the most popular classes of CCID models. It was proposed by [11,12] and extensively studied in the literature and applied to many real-world problems because of its ease of interpretation. To motivate the proposed approaches, we present in Figure 1 a synthetic dataset with n = 350 observations generated from a PoINAR(1) process with parameters α = 0.5 and λ = 5 ( X t , blue line) and the respective right-censored dataset ( Y t , red line), at L = 11 , corresponding to 30% of censoring. If we disregard the censoring, the estimates for the parameters (assuming an PoINAR(1) model without censoring) present a strong bias. For instance, in the frequentist framework, the conditional maximum likelihood estimates are α ^ C M L = 0.6174 and λ ^ C M L = 3.4078 , while in the Bayesian framework, the Gibbs sampler gives α ^ B a y e s = 0.6242 and λ ^ B a y e s = 3.3297 . On the other hand, if we assume a PoINAR(1) model under censoring, the parameter estimates given by the proposed approaches described in this work are, respectively, α ^ A B C = 0.4623 and λ ^ A B C = 5.2259 , and α ^ G D A = 0.4834 and λ ^ G D A = 4.9073 . Therefore, it is important to consider the censoring in data in order to avoid some inference issues that lead to a poor time series analysis.
The remainder of this work is organized as follows. Section 2 presents a general class of convolution closed infinitely divisible (CCID) models under censoring. Two Bayesian approaches proposed to estimate the parameters of the censored CCID(1) model are described in Section 3. The proposed methodologies are illustrated and compared with synthetic data in Section 4. Finally, Section 5 concludes the paper.

2. A Model for Time Series of Counts under Censoring

This section introduces a class of models adequate for censored time series of counts based on the convolution closed infinitely divisible (CCID) models as proposed by [10].

2.1. Convolution Closed Models for Count Time Series

First we introduce some notation. Consider a random variable X with a distribution F μ , μ > 0 , belonging to the convolution closed infinitely divisible (CCID) parametric family [10]. This means, in particular, that the distribution F μ is closed under convolution, F μ 1 F μ 2 = F μ 1 + μ 2 , where * is the convolution operator. Let R ( · ) denote a random operator on X such that R ( X ) F α μ , 0 < α < 1 and the conditional distribution of R ( X ) given X = x is G α μ , ( 1 α ) μ , x , R ( X ) | X = x G α μ , ( 1 α ) μ , x . As an example, consider a Poisson random variable, X Po ( μ ) and a binomial thinning operation, R ( X ) = α X = i = 1 X ξ i , ξ i i i d B e r ( α ) . Then F μ is the Poisson distribution with parameter μ , R ( X ) Po ( α μ ) and R ( X ) | X = x Bi ( x , α ) , G α μ , ( 1 α ) μ , x is the Binomial distribution with parameters x and α .
A stationary time series, { X t ; t = 0 , ± 1 , ± 2 , } with margin F μ , X t F μ , is called a convolution closed infinitely divisible process of order 1, CCID(1), if it satisfies the following equation
X t = R t ( X t 1 ) + e t ,
where the innovations e t are independently and identically distributed (i.i.d.) with distribution F ( 1 α ) μ and { R t ( · ) : t = 0 , ± 1 , ± 2 , } are independent replications of the random operator R ( · ) [10]. Note that the above construction leads to time series with the same marginal distribution as that of the innovations.
Model (1) encompasses many AR(1) models proposed in the literature for integer valued time series. In particular, the Poisson INAR(1), PoINAR(1), the negative binomial INAR(1), NBINAR(1), and the generalised Poisson INAR(1), GPINAR(1) [13], summarized in Table 1 ( marginal distribution, random operation and its pmf g ( · | · ) , set of parameters θ ), have been widely used in the literature to model time series of counts, see inter alia [14,15], among others.
If one chooses F ( 1 α ) μ as Poisson ( ( 1 α ) μ ) , and the random operation as the usual binomial thinning operation (based on underlying Bernoulli random variables) R t ( X t 1 ) = α X t 1 = i = 1 X t 1 ξ t i , ξ t i i i d B e r ( α ) , then F μ is Poisson ( μ ) and the Poisson integer-valued autoregressive model, PoINAR(1), as proposed by [11,12], is recovered with the familiar representation
X t = α X t 1 + ϵ t .
Since model (1) is Markovian [10], given a time series x = ( X 1 , , X n ) , the conditional likelihood is as follows
L ( θ ) = t = 2 n f X t | X t 1 ( x t | x t 1 ) ,
with
f X t | X t 1 ( k | l ) = P ( X t = k | X t 1 = l ) = j = 0 min { k , l } g ( j | l ) P ( e t = k j ) .

2.2. Modelling Censoring in CCID(1) Time Series

Given a model as in (1), a basic question is whether it properly describes all the observations of a given time series, or whether some observations have been affected by censoring. Here, we describe a model for dealing with censored observations in CCID(1) processes and study some of its properties.
Exogenous censoring can be modelled assuming (1) as a latent process and Y t = min { L , X t } as the observed process, where L is a constant that is assumed to be known. For simplicity of exposition we assume exogenous right censoring but all the results are easily extended to left-censoring or interval censoring. Hence, for right exogenous censoring
Y t = min { X t , L } = X t , if X t < L , L , if X t L , X t = R t ( X t 1 ) + e t .
Although X t , a CCID(1) process is Markovian, the exogenous censoring implies that Y t is not Markovian because Y t depends on X t and L . Furthermore, Y t is not CLAR (Conditionally Linear AutoRegressive). In fact,
E Y t | Y t 1 = y t 1 = E Y t | Y t 1 = y t 1 I { y t 1 < L } + E Y t | Y t 1 = L I { y t 1 = L } = E X t | X t 1 = y t 1 j = 0 + j P X t = L + j | X t 1 = y t 1 I { y t 1 < L } + E X t | X t 1 L j = 0 + j P X t = L + j | X t 1 L I { y t 1 = L }
The authors Zeger and Brookmeyer [1] established a procedure to obtain the likelihood of an observed time series under censoring, y = ( Y 1 , , Y n ) , which becomes infeasible when the proportion of censoring is large. To overcome this issue, this work considers a Bayesian approach.

3. Bayesian Modelling

The Bayesian approach to the inference of an unknown parameter vector θ is based on the posterior distribution π ( θ | y ) , defined as
π ( θ | y ) L ( y | θ ) π ( θ ) ,
where L ( y | θ ) is the likelihood function of the observed data y and π ( θ ) is the prior distribution of the model parameters.
When the likelihood is computationally prohibitive or even impossible to handle, but it is feasible to simulate samples from the model (bypass the likelihood evaluation), as is the case of censored CCID(1) processes, Approximate Bayesian Computation (ABC) algorithms are an alternative. This methodology accepts the parameter draws that produce a match between the observed and the simulated sample, depending on a set of summary statistics, a chosen distance and a selected tolerance. The accepted parameters are then used to estimate (an approximation of) the posterior distribution (conditioned on the summary statistics that afforded the match).
On the other hand, the idea of imputation arises naturally in the context of censored data. The Gibbs sampler with Data Augmentation (GDA) allows us to obtain an augmented dataset from the censored data by using a modified version of the Gibbs sampler, which samples not only the parameters of the model from its complete conditional but also the censored observations. The usual inference procedures may then be applied to the augmented data set.

3.1. Approximate Bayesian Computation

Approximate Bayesian Computation (ABC) is based on an acceptance–rejection algorithm. ABC is used to compute a draw from an approximation of the posterior distributions, based on simulated data obtained from the assumed model in situations where its likelihood function is intractable or numerically difficult to handle. Summary statistics from the synthetic data are compared with the corresponding statistics from the observed data and a parameter draw is retained when there is a match (in some sense) between the simulated sample and the observed time series observation.
Recently, Ref. [16] provided the asymptotic results pertaining to the ABC posterior, such as Bayesian consistency and asymptotic distribution of the posterior mean.
Let y 0 = ( Y 1 0 , , Y n 0 ) be the fixed (observed) data and η ( · ) the model from which the data is generated. The most basic approximate acceptance/rejection algorithm, based on the works of [17,18], is as follows:
  • draw a value θ from the prior distribution, π ( θ ) ,
  • simulate a sample y = ( Y 1 , , Y n ) from the model η ( . | θ ) ,
  • accept θ if d ( S ( y ) , S 0 ) ) δ for some distance measure d ( . , . ) and some non-negative tolerance value δ , where S ( · ) is a summary statistic and S 0 = S ( y 0 ) is a fixed value.
It is well known that, if we use a proper distance measure, then as δ tends to zero, the distribution of the accepted values tends to the posterior distribution of the parameter given the data. When the summary statistics are sufficient for the parameter, then the distribution of the accepted values tends to the true posterior as δ tends to zero, assuming a proper distance measure on the space of sufficient statistics. The latent structure of the thinning operator means that the reduction to a sufficient set of statistics of dimension smaller than the sample size is not feasible and, therefore, informative summary statistics are often used [19].
In this work, given the characteristics of the data under study to compare the observed data ( y 0 ) and the synthetic (simulated ) data ( y ), we consider two distinctive characteristics of CCID(1) time series which are affected by the censoring: (i) the empirical marginal distribution and (ii) lag 1 auto-correlation.
To measure the similarity between the empirical marginal distributions the Kullback-Leibler (Note that Kullback-Leibler distance measures the dissimilarity between two probability distributions.) distance is calculated as
S 1 ( y ) = d K L ( p ^ 0 , p ^ ) = j ln p ^ j 0 p ^ j p ^ j 0 ,
where p ^ j 0 and p ^ j denote the empirical marginal distribution of the observed time series and that of the simulated time series, respectively, estimated by the corresponding sample proportions, p ^ j 0 = 1 n j = 1 n I { Y j 0 = j }   and   p ^ j = 1 n j = 1 n I { Y j = j } . Whenever p ^ j 0 is zero, the contribution of the jth term is interpreted as zero because lim p 0 p ln ( p ) = 0 .
On the other hand, lag 1 sample autocorrelations, S 2 ( y 0 ) = ρ ^ Y 0 ( 1 ) and S 2 ( y ) = ρ ^ Y ( 1 ) , are compared by their squared difference.
Additionally, we estimated the censoring rates, S 3 ( y 0 ) = 1 n t = 1 n I { y t 0 = L } and S 3 ( y ) = 1 n t = 1 n I { Y t = L } , which are also compared by their squared difference.
Thus, for each set of parameters, θ ( k ) , a time series x ( k ) is generated from the model CCID(1) and right censored at L, yielding y ( k ) = ( Y 1 ( k ) , , Y n ( k ) ) and the above statistics, S 1 ( y ( k ) ) , S 2 ( y ( k ) ) and S 3 ( y ( k ) ) are computed. Combining these statistics in a metric leading to the choice of the parameters θ requires scaling. Thus, we propose the following metric
d S ( k ) = S 1 ( y ( k ) ) 2 V ( S 1 ( y ) ) + i = 2 3 [ S i ( y 0 ) S i ( y ( k ) ) ] 2 V ( S i ( y 0 ) S i ( y ) )
where S i ( y 0 ) and S i ( y ( k ) ) are the i th statistics obtained respectively from the observed and k th simulated data and V ( S 1 ( y ) ) and V ( S i ( y 0 ) S i ( y ) ) are the corresponding sample variances across the replications.
In summary, we propose Algorithm 1 for ABC approach based on [20]:
Algorithm 1 ABC for censored CCID(1)
For k = 1 , , N
     Sample θ ( k ) from the prior distribution π ( θ )
     Generate a time series with n observations, x ( k ) from the CCID(1) model
     Right truncate at L  x ( k ) to obtain the simulated data y ( k )
     Compute S 1 ( y ( k ) ) ,   S 2 ( y ( k ) ) and S 3 ( y ( k ) )
Compute d S ( k ) = S 1 ( y ( k ) ) 2 V ( S 1 ( y ) ) + i = 2 3 [ S i ( y 0 ) S i ( y ( k ) ) ] 2 V ( S i ( y 0 ) S i ( y ) ) , k = 1 , , N
Select the values θ ( k ) corresponding to the 0.1% quantile of d S ( k ) ,   k = 1 , , N
Implementation issues regarding the prior distributions and the number of draws N for the CensPoINAR(1) model are addressed in Section 3.3 and Section 4.1.

3.2. Gibbs Sampler with Data Augmentation

Gibbs sampling is a Markov chain Monte Carlo (MCMC) algorithm that can generate samples of the posterior distribution from their full conditional distributions [21]. When the data are under censoring or there are missing values, both cases leading to an incomplete data set, Ref. [22] proposed to combine the Gibbs sampler with data augmentation. This methodology implies imputing the censored (or missing) data, thus obtaining a complete dataset, and then dealing with the posterior of the complete data through the iterative Gibbs sampler. Therefore, the Gibbs sampler is modified in order to sample not only the parameters of the model from their complete conditionals but also the censored observations, obtaining an augmented (complete) dataset z = ( z 1 , , z n ) where
z t = Y t , if   Y t < L z t F μ ( x | x L ) , if   Y t = L
where F μ ( x | x L ) is the truncated marginal distribution of the CCID(1) model with support in [ L , + [ . Furthermore, we consider a modified sampling procedure for the imputation, designated as Mean of Multiple Simulation (MMS), proposed by [23] consisting in sampling from F μ ( x | x L ) multiple times, say m , and then imputing with the (nearest integer value) median of the m samples. This procedure is designated by GDA-MMS.
The augmented dataset can be considered as a CCID(1) time series and with a conditional likelihood function given by Equation (3). The posterior distribution of θ is given by
p ( θ | z ) L ( z | θ ) π ( θ )
where π ( θ ) is the prior distribution of the parameters. In CCID(1) models the complexity of p ( θ | z ) requires resorting to Markov Chain Monte Carlo (MCMC) techniques for sampling from the full conditional distributions. The procedure is summarized in Algorithm 2 and detailed for the CensPoINAR(1) case in Section 3.3 and Section 4.1.
Algorithm 2 GDA-MMS for censored CCID(1)
Initialize with y = ( Y 1 , , Y n ) , θ ( 0 ) = ( θ 1 ( 0 ) , , θ p ( 0 ) ) , L R , and n , m , N N
Set z ( 0 ) = y
For k = 1 , , N
     Sample θ i ( k ) π ( θ i | θ ( i ) ( k 1 ) , z ( k 1 ) ) ( x ( i ) represents the vector x with the ith element
removed.), i = 1 , , p
     For t = 1 , , n
        If Y t = L
            For j = 1 , , m
               Sample z t ( j ) F ( x | θ ( k ) , x L )
            z t ( k ) : = 1 m j = 1 m z t ( j )
       Else
            z t ( k ) : = Y t
Return θ = [ θ ( 1 ) , , θ ( N ) ] and z ( N ) .

3.3. The Particular Case of CensPoINAR(1)

This section details the ABC and GDA-MMS procedures to estimate a censored CCID(1) with the binomial thinning operation and Poisson marginal distribution, the censored Poisson INAR(1), CensPoINAR(1), model.
Consider the censored observations y = ( Y 1 , , Y n ) from a PoINAR(1) time series x = ( X 1 , , X n ) defined as
Y t = min { X t , L } = X t , if   X t < L L , if   X t L X t = α X t 1 + e t ,
with α X t 1 = i = 1 X t 1 ξ t i , ξ t i i i d B e r ( α ) , e t P o ( λ ) and X t P o ( λ 1 α ) . Then θ = ( α , λ ) and given x , the conditional likelihood function is given by
L ( θ ) = t = 2 n f X t | X t 1 ( x t | x t 1 ) = t = 2 n j = 0 min { x t , x t 1 } x t 1 j α j ( 1 α ) ( x t 1 j ) e λ λ x t j ( x t j ) ! .
Under a Bayesian approach, we need a prior distribution for θ . In the absence of prior information, we use weakly informative prior distributions for θ detailed below.

3.3.1. ABC for Censored PoINAR(1)

The ABC procedure described in Algorithm 1 is now implemented for the censored PoINAR(1). For the parameter 0 < α < 1 , we choose a non-informative prior U ( 0 , 1 ) , while for the positive parameter λ , we choose a non-informative U ( 0 , 10 ) . The former allows us to explore all the support space for α . The choice of U ( 0 , 10 ) as a prior for λ > 0 allows us to explore a restricted support for the parameter that is in accordance with small counts.

3.3.2. GDA-MMS for Censored PoINAR(1)

Under the GDA-MMS approach, we first need to obtain a complete data set z = ( z 1 , , z n ) by imputing the censored observations, see (8). In this work, we draw m = 10 replicates of the right truncated at L Poisson distribution with parameter λ 1 α , w i Po λ 1 α × I ( w i L ) and set z t = m e d i a n ( w ) ( c represents the integer ceiling of c), w = ( w 1 , , w m ) , if Y t = L . Figure 2 shows an augmented dataset ( Z t , black line) from the synthetic data presented in Figure 1.
As remarked above, given the complexity of the posterior distribution, Markov Chain Monte Carlo techniques are required for sampling from the full conditional distributions. Thus, the Adaptive Rejection Metropolis Sampling (ARMS) is used inside the Gibbs sampler [24]. Also in this approach, in the absence of prior information, we use weakly informative prior distributions for ( α , λ ) . Thus, for the parameter 0 < α < 1 , we choose a non-informative beta prior, conjugate of the binomial distribution, with parameters ( a , b ) , while for the positive parameter μ , we choose a non-informative Gamma (shape, rate) prior, conjugate of the Poisson distribution, with parameters ( c , d ) . The full conditional of λ is given by
p ( λ | α , z ) = p ( λ , α | z ) p ( α | z ) exp [ ( d + ( n 1 ) ) λ ] λ c 1 t = 2 n i = 0 min { z t , z t 1 } C ( t , i ) λ ( z t ) i ,
where
C ( t , i ) = 1 ( ( z t ) i ) ! z t 1 i α i ( 1 α ) ( z t 1 ) i a n d λ > 0 .
The full conditional distribution of α is given by
p ( α | λ , z ) = p ( λ , α | z ) p ( λ | z ) α a 1 ( 1 α ) b 1 t = 2 n i = 0 min { z t , z t 1 } K ( t , i ) α i ( 1 α ) ( z t 1 ) i ,
where
K ( t , i ) = λ ( z t ) i ( ( z t ) i ) ! X t 1 i 0 < α < 1 .
The parameters α and λ are computed as the posterior mean.
The GDA-MMS procedure to estimate a censored PoINAR(1) process is detailed in Algorithm 3.
Algorithm 3 GDA-MMS for CensPoINAR(1)
Initialize with y , θ ( 0 ) = ( α ( 0 ) , λ ( 0 ) ) , L R , and N , m N
Set z ( 0 ) = y
For k = 1 , , N
     Using ARMS
        Sample λ ( k ) p ( λ | α ( k 1 ) , z ( k 1 ) )
        Sample α ( k ) p ( α | λ ( k ) , z ( k 1 ) )
     For t = 1 , , n
        If Y t = L
            For j = 1 , , m
               Sample w ( j ) Po λ ( k ) 1 α ( k ) × I ( w ( j ) L )
            z t ( k ) : = m e d i a n ( w ) , w = ( w ( 1 ) , , w ( m ) ) ,
       Else
            z t ( k ) : = Y t
Return θ = [ θ ( 1 ) , , θ ( N ) ] and z ( N ) .

4. Illustration

This section illustrates the procedures proposed above to model CCID(1) right-censored time series in the particular case of Poisson distribution and binomial thinning operation.

4.1. Illustration with CensPoINAR(1)

In this section, the performance of the Bayesian approaches previously proposed is illustrated via synthetic data. Thus, realizations with n = 100 , 350 , 1000 observations of CensPoINAR(1) models were simulated, with parameters θ = ( 0.2 , 3 ) and θ = ( 0.5 , 5 ) , considering for each case two levels of censorship, namely 30% and 5%.
For the ABC estimates, we run N = 10 6 replications and choose the pairs ( α , λ ) corresponding to the 0.1% lower quantile of d S ( k ) , Equation (7), in total of 1000 values from which the estimates are computed as the mean value. Software R [25] was used to implement the ABC algorithm.
To implement GDA-MMS algorithm we consider the initial values θ ( 0 ) = ( α ( 0 ) , λ ( 0 ) ) given by the Conditional Least squares estimates of α and λ [24]. The hyper-parameters for the prior distributions of α and λ are the following α B e t a ( a = 2 , b = 2 ) and λ G a m m a ( c = 0.1 , d = 0.1 ) . In this work, the function armspp was used from the package armspp [26] in R to sample from the full conditional distributions. Several experiments were carried out to analyse the size that the chain should have in order to be stable and, thus, the number of Gibbs sampler iterations used in this work is N = 15,000. Among these, we ignored the first 5000 simulations as burn-in time and, to reduce autocorrelation between MCMC observations, we considered only simulations from every 30 iterations. Therefore, we use a simulated sample with size 323 to obtain the Bayesian estimates. A convergence analysis with the usual diagnostic tests was performed with the package coda [27] in R [25].
Table 2 and Table 3 summarize ABC and GDA-MMS results for the several scenarios described above: point estimates, α ^ and λ ^ , obtained as sample means, the corresponding bias, standard deviation and the coefficient of variation. The results indicate that the bias tends to decrease for large sample sizes and small censoring rates. The results also indicate that overall ABC presents estimates with smaller bias but larger variability when compared with GDA-MMS.
Additionally, Figure 3 and Figure 4 represent the corresponding posterior densities. The plots show unimodal and approximately symmetric distributions, with a dispersion that clearly decreases with increasing sample size and smaller censoring rate. The posterior densities indicate that the ABC approach produces posteriors that are flatter but with modes very close to the true value, while the corresponding GDA-MMS approach, despite producing posteriors which are more concentrated also evidence higher bias. However, the behaviour of GDA-MMS estimates varies with the parameters and even the sample sizes. These results are representative of the properties of GDA-MMS estimates across a large number of experiments, not reported here for conciseness.

4.2. Simulation Study for GDA-MMS

This section presents the results of a simulation study designed to further analyse the sample properties of GDA-MMS, in particular the bias of the resulting Bayesian estimates.
For that purpose, realizations with sample sizes n = 100 and n = 350 of CensPoINAR(1) models with parameters θ = ( 0.2 , 3 ) and θ = ( 0.5 , 5 ) , are generated, considering two levels of censorship, namely 30% and 5%. To analyse the performance of the procedure, the sample posterior mean, standard deviation and mean squared error were calculated over 50 repetitions.
Boxplots of the sample bias for the 50 repetitions of GDA-MMS methodology are presented in Figure 5 and Figure 6. The bias increases with the rate of censoring and the variability decreases with the sample size. Furthermore, in general, the estimates for α presents positive sample mean biases, indicating that α is overestimated, whilst the estimates for λ shows negative sample biases, indicating underestimation for λ . Both bias and dispersion seem larger for λ .
Table 4 and Table 5 present the sample posterior measures for α ^ and λ ^ , respectively. We can see improvement of the estimation methods performance when the sample size increases. Additionally, the higher the censoring percentage, the worse the behavior of the proposed methods.

5. Final Comments

This work approaches the problem of estimating CCID(1) models for time series of counts under censoring from a Bayesian perspective. Two algorithms are proposed: one is based on ABC methodology and the second a Gibbs Data Augmentation modified with multiple sampling. Experiments with synthetic data allow us to conclude that both approaches lead to estimates that present less bias than those obtained neglecting the censoring. Moreover, the GDA-MMS approach allows us to obtain a complete data set, making it a valuable method in other situations such as missing data.
In this study, we focus on the most popular CCID(1) model, the Poisson INAR(1). However, if the data under study present over- or under-dispersion, other CCID(1) models with appropriate distributions for the innovations, such as Generalised Poisson or Negative Binomial, can easily be entertained. Furthermore, one can consider different models for time series of counts under censoring, based on INGARCH models, ([28,29] using a switching mechanism) if they are more suitable to the data set to be modeled. These issues are beyond the scope of this paper and are a topic for a future research project.

Author Contributions

The authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

The first and third authors were partially supported by The Center for Research and Development in Mathematics and Applications (CIDMA) through the Portuguese Foundation for Science and Technology (FCT—Fundação para a Ciência e a Tecnologia), reference UIDB/04106/2020. The second author is partially financed by National Funds through the Portuguese funding agency, FCT, within project UIDB/50014/2020.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zeger, S.; Brookmeyer, R. Regression Analysis with Censored Autocorrelated Data. J. Am. Stat. Assoc. 1986, 81, 722–729. [Google Scholar] [CrossRef]
  2. Hopke, P.K.; Liu, C.; Rubin, D.R. Multiple imputation for multivariate data with missing and below-threshold measurements: Time-series concentrations of pollutants in the arctic. Biometrics 2001, 57, 22–33. [Google Scholar] [CrossRef] [PubMed]
  3. Park, J.W.; Genton, M.G.; Ghosh, S.K. Censored time series analysis with autoregressive moving average models. Can. J. Stat. 2007, 35, 151–168. [Google Scholar] [CrossRef] [Green Version]
  4. Mohammad, N.M. Censored Time Series Analysis. Ph.D. Thesis, The University of Western Ontario, London, ON, Canada, 2014. [Google Scholar]
  5. Schumacher, F.L.; Lachos, V.H.; Dey, D.K. Censored regression models with autoregressive errors: A likelihood-based perspective. Can. J. Stat. 2017, 45, 375–392. [Google Scholar] [CrossRef]
  6. Wang, C.; Chan, K.S. Carx: An R Package to Estimate Censored Autoregressive Time Series with Exogenous Covariates. R J. 2017, 9, 213–231. [Google Scholar] [CrossRef]
  7. Wang, C.; Chan, K.S. Quasi-Likelihood Estimation of a Censored Autoregressive Model with Exogenous Variables. J. Am. Stat. Assoc. 2018, 113, 1135–1145. [Google Scholar] [CrossRef]
  8. Fernández-Fontelo, A.; Cabaña, A.; Puig, P.; Moriña, D. Under-reported data analysis with INAR-hidden Markov chains. Stat. Med. 2016, 35, 4875–4890. [Google Scholar] [CrossRef]
  9. De Oliveira, G.L.; Loschi, R.H.; Assunção, R.M. A random-censoring Poisson model for underreported data. Stat. Med. 2017, 36, 4873–4892. [Google Scholar] [CrossRef]
  10. Joe, H. Time series model with univariate margins in the convolution-closed infinitely divisible class. Appl. Probab. 1996, 33, 664–677. [Google Scholar] [CrossRef]
  11. Al-Osh, M.A.; Alzaid, A.A. First-order integer-valued autoregressive (INAR(1)) process. J. Time Ser. Anal. 1987, 8, 261–275. [Google Scholar] [CrossRef]
  12. McKenzie, E. Some ARMA models for dependent sequences of Poisson counts. Adv. Appl. Probab. 1988, 20, 822–835. [Google Scholar] [CrossRef]
  13. Alzaid, A.A.; Al-Osh, M.A. Some autoregressive moving average processes with generalized Poisson marginal distributions. Ann. Inst. Stat. Math. 1993, 45, 223–232. [Google Scholar] [CrossRef]
  14. Han, L.; McCabe, B. Testing for parameter constancy in non-Gaussian time series. J. Time Ser. Anal. 2013, 34, 17–29. [Google Scholar] [CrossRef]
  15. Jung, R.C.; Tremayne, A.R. Useful models for time series of counts or simply wrong ones? AStA Adv. Stat. Anal. 2011, 95, 59–91. [Google Scholar] [CrossRef]
  16. Frazier, D.T.; Martin, G.M.; Robert, C.P.; Rousseau, J. Asymptotic properties of approximate Bayesian computation. Biometrika 2018, 105, 593–607. [Google Scholar] [CrossRef] [Green Version]
  17. Plagnol, V.; Tavaré, S. Approximate Bayesian computation and MCMC. In Monte Carlo and Quasi-Monte Carlo Methods; Niederreiter, H., Ed.; Springer: Berlin/Heidelberg, Germany, 2003; pp. 99–113. [Google Scholar]
  18. Wilkinson, R.D. Approximate Bayesian computation (ABC) gives exact results under the assumption of model error. Stat. Appl. Genet. Mol. Biol. 2013, 12, 129–141. [Google Scholar] [CrossRef]
  19. Frazier, D.T.; Maneesoonthorn, W.; Martin, G.M.; McCabe, B. Approximate Bayesian forecasting. Int. J. Forecast. 2019, 35, 521–539. [Google Scholar] [CrossRef] [Green Version]
  20. Biau, G.; Cérou, F.; Guyader, A. New Insights into Approximate Bayesian Computation. Ann. Henri Poincaré (B) Probab. Stat. 2015, 51, 376–403. [Google Scholar] [CrossRef]
  21. Gelfand, A.; Smith, A. Sampling-based approaches to calculating marginal densities. J. R. Stat. Soc. Ser. B 1990, 85, 398–409. [Google Scholar] [CrossRef]
  22. Chib, S. Bayes Inference in the Tobit Censored Regression Model. J. Econom. 1992, 51, 79–99. [Google Scholar] [CrossRef]
  23. Sousa, R.; Pereira, I.; Silva, M.E.; McCabe, B. Censored Regression with Serially Correlated Errors: A Bayesian approach. arXiv 2023, arXiv:2301.01852. [Google Scholar]
  24. Silva, I.; Silva, M.E.; Pereira, I.; Silva, N. Replicated INAR(1) Processes. Methodol. Comput. Appl. Probab. 2005, 7, 517–542. [Google Scholar] [CrossRef] [Green Version]
  25. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2022; Available online: https://www.R-project.org/ (accessed on 20 March 2023).
  26. Bertolacci, M. Armspp: Adaptive Rejection Metropolis Sampling (ARMS) via ‘Rcpp’. R Package Version 0.0.2. 2019. Available online: https://CRAN.R-project.org/package=armspp (accessed on 20 March 2023).
  27. Plummer, M.; Best, N.; Cowles, K.; Vines, K. CODA: Convergence Diagnosis and Output Analysis for MCMC. R News 2006, 6, 7–11. [Google Scholar]
  28. Chen, C.W.S.; Lee, S. Generalized Poisson autoregressive models for time series of counts. Comput. Stat. Data Anal. 2016, 99, 51–67. [Google Scholar] [CrossRef]
  29. Ferland, R.; Latour, A.; Oraichi, D. Integer-Valued GARCH Process. J. Time Ser. Anal. 2006, 27, 923–942. [Google Scholar] [CrossRef]
Figure 1. Synthetic dataset with n = 350 observations generated from a PoINAR(1) process with parameters α = 0.5 and λ = 5 ( X t , blue line) and the respective right censored dataset ( Y t , red line), at L = 11 .
Figure 1. Synthetic dataset with n = 350 observations generated from a PoINAR(1) process with parameters α = 0.5 and λ = 5 ( X t , blue line) and the respective right censored dataset ( Y t , red line), at L = 11 .
Entropy 25 00549 g001
Figure 2. Synthetic dataset with n = 350 observations generated from a PoINAR(1) process with parameters α = 0.5 and λ = 5 ( X t , blue line), the respective right-censored dataset ( Y t , red line), at L = 11 , and an example of data augmentation ( Z t , black line).
Figure 2. Synthetic dataset with n = 350 observations generated from a PoINAR(1) process with parameters α = 0.5 and λ = 5 ( X t , blue line), the respective right-censored dataset ( Y t , red line), at L = 11 , and an example of data augmentation ( Z t , black line).
Entropy 25 00549 g002
Figure 3. ABC and GDA-MMS posterior densities of the parameters for a realization of 100, 350 and 1000 observations of a CensPoINAR(1) model with θ = ( 0.2 , 3 ) , considering two levels of censoring. Note that the scale of x-axis of the six plots are different.
Figure 3. ABC and GDA-MMS posterior densities of the parameters for a realization of 100, 350 and 1000 observations of a CensPoINAR(1) model with θ = ( 0.2 , 3 ) , considering two levels of censoring. Note that the scale of x-axis of the six plots are different.
Entropy 25 00549 g003
Figure 4. ABC and GDA-MMS posterior densities of the parameters for a realization of 100, 350 and 1000 observations of a CensPoINAR(1) model with θ = ( 0.5 , 5 ) , considering two levels of censoring. Note that the scale of x-axis of the six plots are different.
Figure 4. ABC and GDA-MMS posterior densities of the parameters for a realization of 100, 350 and 1000 observations of a CensPoINAR(1) model with θ = ( 0.5 , 5 ) , considering two levels of censoring. Note that the scale of x-axis of the six plots are different.
Entropy 25 00549 g004
Figure 5. Boxplots of bias for GDA-MMS estimates of α , when θ = ( 0.5 , 5 ) .
Figure 5. Boxplots of bias for GDA-MMS estimates of α , when θ = ( 0.5 , 5 ) .
Entropy 25 00549 g005
Figure 6. Boxplots of bias for GDA-MMS estimates of λ , when θ = ( 0.5 , 5 ) .
Figure 6. Boxplots of bias for GDA-MMS estimates of λ , when θ = ( 0.5 , 5 ) .
Entropy 25 00549 g006
Table 1. Methods for constructing integer valued AR(1) models with specified marginals F μ and innovations e t = i i d F λ , λ = μ ( 1 α ) . B ( · , · ) denotes the beta function.
Table 1. Methods for constructing integer valued AR(1) models with specified marginals F μ and innovations e t = i i d F λ , λ = μ ( 1 α ) . B ( · , · ) denotes the beta function.
Marginal DistributionRandom Operator g ( s | X t 1 ; α ) Innovations θ
Poison Po ( μ ) binomial thinning X t 1 s α s ( 1 α ) X t 1 s Po ( λ ) ( μ , α )
Negative binomial NB ( μ , ξ ) beta binomial thinning X t 1 s α s ( 1 α ) X t 1 s NB ( λ , ξ ) ( μ , α , ξ )
Generalised Poisson GP ( μ , ξ ) quasi binomial thinning X t 1 s α ( α + s ( ξ / μ ) ) s 1
( 1 α s ( ξ / μ ) ) X t 1 s
GP ( λ , ξ ) ( μ , α , ξ )
Table 2. ABC and GDA-MMS results for parameter α (sample mean, and the corresponding bias, standard deviation and coefficient of variation) for synthetic data generated from CensPoINAR(1) models.
Table 2. ABC and GDA-MMS results for parameter α (sample mean, and the corresponding bias, standard deviation and coefficient of variation) for synthetic data generated from CensPoINAR(1) models.
ABCGDA-MMS
α λ L n α ^ ¯ Bias( α ^ )s.d.( α ^ )CV( α ^ ) α ^ ¯ Bias( α ^ )s.d.( α ^ )CV( α ^ )
0.234 (30%)1000.25710.05710.09110.35440.31550.11550.03230.1024
3500.20670.00670.05790.28030.22740.02740.01780.0783
10000.1793−0.02070.03980.22170.20250.00250.01570.0775
0.236 (5%)1000.22680.02680.07600.33500.27380.07380.02700.0986
3500.23020.03020.05110.22210.23090.03090.01400.0606
10000.1931−0.00690.03270.16920.1915−0.00850.01120.0585
0.5511 (30%)1000.53040.03040.08000.15080.55960.05960.01700.0304
3500.4637−0.03630.05350.11530.4834−0.01660.01240.0257
10000.51150.01150.03200.06260.50500.00500.00720.0143
0.5514 (5%)1000.52300.02300.08150.15590.53630.03630.01750.0326
3500.4671−0.03290.04610.09870.4796−0.02040.01070.0223
10000.4992−0.00080.02910.05840.50080.00080.00700.0140
Table 3. ABC and GDA-MMS results for the parameter λ (sample mean, and the corresponding bias, standard deviation and coefficient of variation) for synthetic data generated from CensPoINAR(1) models.
Table 3. ABC and GDA-MMS results for the parameter λ (sample mean, and the corresponding bias, standard deviation and coefficient of variation) for synthetic data generated from CensPoINAR(1) models.
ABCGDA-MMS
α λ L n λ ^ ¯ Bias( λ ^ )s.d.( λ ^ )CV( λ ^ ) λ ^ ¯ Bias( λ ^ )s.d.( λ ^ )CV( λ ^ )
0.234 (30%)1002.6623−0.33770.36990.13892.3265−0.67350.11440.0492
3502.8530−0.14700.23530.08252.6639−0.33610.06720.0252
10003.13980.13980.16680.05312.9757−0.02430.06030.0203
0.236 (5%)1002.7918−0.20820.32030.11472.5719−0.42810.10070.0392
3502.9507−0.04930.21730.07362.7846−0.21540.05790.0208
10003.13420.13420.14480.04623.04170.04170.04600.0151
0.5511 (30%)1004.3432−0.65680.75040.17283.9528−1.04720.16000.0405
3505.23150.23150.52650.10064.9073−0.09270.11770.0240
10004.9102−0.08980.32470.06614.8974−0.10260.07200.0147
0.5514 (5%)1004.4488−0.55120.78280.17604.2574−0.74260.16820.0395
3505.13330.13330.42860.08354.9877-0.01230.10880.0218
10005.06130.06130.28260.05584.9964−0.00360.07080.0142
Table 4. Sample posterior mean, standard errors (in brackets) and root mean square error for GDA-MMS estimates of α .
Table 4. Sample posterior mean, standard errors (in brackets) and root mean square error for GDA-MMS estimates of α .
α λ Ln α ^ ¯ (s.e. ( α ^ ) )RMSE ( α ^ )
0.234 (30%)1000.2918 (0.0977)0.1341
3500.2385 (0.0698)0.0797
0.236 (5%)1000.2739 (0.0680)0.1004
3500.2229 (0.0487)0.0538
0.5511 (30%)1000.5404 (0.0632)0.0750
3500.5156 (0.0344)0.0378
0.5514 (5%)1000.5142 (0.0626)0.0642
3500.5066 (0.0386)0.0392
Table 5. Sample posterior mean, standard errors (in brackets) and root mean square error for GDA-MMS estimates of λ .
Table 5. Sample posterior mean, standard errors (in brackets) and root mean square error for GDA-MMS estimates of λ .
α λ Ln λ ^ ¯ (s.e. ( λ ^ ) )RMSE ( λ ^ )
0.234 (30%)1002.5283 (0.3077)0.5632
3502.7842 (0.2462)0.3274
0.236 (5%)1002.6814 (0.2984)0.4365
3502.8934 (0.1843)0.2129
0.5511 (30%)1004.4861 (0.6710)0.8452
3504.7976 (0.3357)0.3920
0.5514 (5%)1004.7593 (0.6391)0.6829
3504.9229 (0.4177)0.4248
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Silva, I.; Silva, M.E.; Pereira, I.; McCabe, B. Time Series of Counts under Censoring: A Bayesian Approach. Entropy 2023, 25, 549. https://doi.org/10.3390/e25040549

AMA Style

Silva I, Silva ME, Pereira I, McCabe B. Time Series of Counts under Censoring: A Bayesian Approach. Entropy. 2023; 25(4):549. https://doi.org/10.3390/e25040549

Chicago/Turabian Style

Silva, Isabel, Maria Eduarda Silva, Isabel Pereira, and Brendan McCabe. 2023. "Time Series of Counts under Censoring: A Bayesian Approach" Entropy 25, no. 4: 549. https://doi.org/10.3390/e25040549

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop