*Article* **A New First-Order Integer-Valued Autoregressive Model with Bell Innovations**

**Jie Huang and Fukang Zhu \***

School of Mathematics, Jilin University, 2699 Qianjin Street, Changchun 130012, China; jiehuang19@mails.jlu.edu.cn

**\*** Correspondence: fzhu@jlu.edu.cn

**Abstract:** A Poisson distribution is commonly used as the innovation distribution for integer-valued autoregressive models, but its mean is equal to its variance, which limits flexibility, so a flexible, one-parameter, infinitely divisible Bell distribution may be a good alternative. In addition, for a parameter with a small value, the Bell distribution approaches the Poisson distribution. In this paper, we introduce a new first-order, non-negative, integer-valued autoregressive model with Bell innovations based on the binomial thinning operator. Compared with other models, the new model is not only simple but also particularly suitable for time series of counts exhibiting overdispersion. Some properties of the model are established here, such as the mean, variance, joint distribution functions, and multi-step-ahead conditional measures. Conditional least squares, Yule–Walker, and conditional maximum likelihood are used for estimating the parameters. Some simulation results are presented to access these estimates' performances. Real data examples are provided.

**Keywords:** Bell distribution; count time series; estimation; INAR; overdispersion

**Citation:** Huang, J.; Zhu, F. A New First-Order Integer-Valued Autoregressive Model with Bell Innovations. *Entropy* **2021**, *23*, 713. https://doi.org/10.3390/e23060713

Received: 5 April 2021 Accepted: 1 June 2021 Published: 4 June 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

In recent years, studying count time series has attracted a lot of attention in different fields, such as finance, medical science, and insurance. There are many models for count data that have been proposed by scholars. The most famous model was first introduced by McKenzie (1985) [1] and Al-Osh and Alzaid (1987) [2] based on the binomial thinning ◦ operator (Steutel and van Harn 1979 [3]) called the first-order integer-valued autoregressive (INAR(1)) process. Given a non-negative integer-valued random variable (r.v.) *X* and a constant *<sup>α</sup>* <sup>∈</sup> (0, 1), the binomial thinning operator ◦ is defined as *<sup>α</sup>* ◦ *<sup>X</sup>* <sup>=</sup> <sup>∑</sup>*<sup>X</sup> <sup>i</sup>*=<sup>1</sup> *ξi*, where the counting series *ξ<sup>i</sup>* is a sequence of independent identically distributed (i.i.d.) Bernoulli r.v.s with *P*(*ξ<sup>i</sup>* = 1) = 1 − *P*(*ξ<sup>i</sup>* = 0) = *α*. Then, the form of the INAR(1) model is

$$X\_t = \mathfrak{a} \circ X\_{t-1} + \mathfrak{e}\_{t\prime} \ \ t = 0, 1, 2, \dots, \tag{1}$$

where *<sup>t</sup>* is a sequence of i.i.d. discrete r.v.s, with the mean *μ* and finite variance *σ*<sup>2</sup> . *<sup>t</sup>* is independent of *ξ<sup>i</sup>* and *Xt*−*<sup>s</sup>* for *s* ≥ 1. According to Alzaid and Al-Osh (1988) [4], we know that the mean and variance of the INAR(1) model are

$$
\mu := \mu\_X = \frac{\mu\_c}{1 - a} \text{ and } \sigma^2 := \sigma\_X^2 = \frac{\sigma\_c^2 + a\mu\_c}{1 - a^2}, \text{ respectively.}
$$

For innovation *t*, the Poisson distribution is often assumed as the distribution of *<sup>t</sup>* in the INAR(1) model. A natural characteristic of the Poisson distribution is equidispersion; i.e., its mean and variance are equal to each other. In practice, however, many data examples are overdispersed (variance is greater than mean) relative to the Poisson distribution. For this reason, the INAR(1) model with Poisson innovations is not always suitable for modeling integer-valued time series. Therefore, several models which describe the overdispersion phenomena have been discussed in the statistical literature.

One common approach is to change the thinning operation in the INAR(1) model. Weiß (2018) [5] summarized several alternative thinning operators, such as random coefficient thinning, iterated thinning and quasi-binomial thinning. Risti´c et al. (2009) [6] proposed the negative binomial thinning operator and defined the corresponding INAR(1) process with geometric marginal distributions. Liu and Zhu (2021) [7] generalized the binomial thinning operator to the extended binomial one.

Changing the distribution of innovations is also used to modify the INAR(1) model. Jung et al. (2005) [8] indicated that the INAR(1) model with negative binomial innovations (NB-INAR(1)) is appropriate for generating overdispersion. Jazi et al. (2012) [9] defined a zero-inflated Poisson ZIP(*ρ*, *λ*) for innovations (ZIP-INAR(1)), because a frequent occurrence in overdispersion is that the incidence of zero counts is greater than expected from the Poisson distribution. Jazi et al. (2012) [10] proposed a modification of the INAR(1) model with geometric innovations (G-INAR(1)) for modeling overdispersed count data. Schweer and Weiß (2014) [11] investigated the compound Poisson INAR(1) (CP-INAR(1)) model, which is suitable for fitting datasets with overdispersion. According to Schweer and Weiß (2014) [11], we can also know that the negative binomial distribution and the geometric distribution both belong to the compound Poisson distribution. Livio et al. (2018) [12] presented the INAR(1) model with the Poisson–Lindley innovations, i.e., PL-INAR(1). Bourguignon et al. (2019) [13] introduced the INAR(1) model with the double Poisson (DP-INAR(1)) and generalized Poisson innovations (GP-INAR(1)). Qi et al. (2019) [14] considered zero-and-one inflated INAR(1)-type models, and Cunha et al. (2021) [15] introduced an INAR(1) model with Borel innovations to model zero truncated count time series.

This paper applies the second approach to dealing with overdispersion. Although several models have been proposed in recent years, most of the considered distributions are based on some generalizations of the Poisson distribution and have more than one parameter, such as the zero-inflated Poisson, compound Poisson, double Poisson, and generalized Poisson distributions. Here we use a relatively simple distribution introduced by Castellares et al. (2018) [16] for the innovations, i.e., the Bell distribution. It has the advantages of having only one parameter, belonging to the exponential family, having a simple probability mass function, and having infinite divisibility. Infinite divisibility is significant for constructing the binomial thinning INAR(1) model. Further, the Bell distribution is suitable for modeling some overdispersed count data. Therefore, we introduce a new INAR(1) model with Bell innovations (BL-INAR(1)), which can account for overdispersion in an INAR(1) framework.

In order to observe whether the BL-INAR(1) model has advantages, we compare it with the INAR(1) model with Poisson innovations (P-INAR(1)), G-INAR(1), PL-INAR(1), NB-INAR(1), ZIP-INAR(1), DP-INAR(1), and GP-INAR(1) models. Different information criteria, such as Akaike's information criterion (AIC) [17], the Bayesian information criterion (BIC) [18], the consistent Akaike information criterion (CAIC) [19], and the Hannan–Quinn information criterion (HQIC) [20], are used to compare the above eight models. By comparing the results of different information criteria, it can be seen that the BL-INAR(1) model is competitive when modeling the overdispersed integer-valued time series data, which shows that the proposed BL-INAR(1) model is meaningful; see Section 5 for more details.

We organize the remaining parts of this paper as follows. In Section 2, we briefly review the Bell distribution, including its definition and some properties. Then we propose the BL-INAR(1) model, and its basic properties are constructed; conditional mean and variance are obtained. Section 3 discusses estimates of the model parameters by using the conditional least squares (CLS), Yule–Walker (YW), and conditional maximum likelihood (CML) methods. In Section 4, a numerical simulation of the estimates is presented with some discussions. In Section 5, we compare the proposed model with the other seven INAR(1)-type models when fitting two real data examples, which show the superior performances of the proposed model. The paper concludes in Section 6.

#### **2. The BL-INAR(1) Model**

In this section, we present a brief review of the Bell distribution (Castellares et al., 2018 [16]). Its definition and some properties are presented. Later we introduce the BL-INAR(1) model and derive some basic properties of it.

#### *2.1. The Bell Distribution*

At first, we introduce the Bell numbers. Bell (1934) [21] has provided the following expansion:

$$\exp(\mathbf{e}^x - 1) = \sum\_{n=0}^{\infty} \frac{B\_n}{n!} \mathbf{x}^n, \quad \mathbf{x} \in \mathbb{R}\_{\succ}$$

where *Bn* is the Bell number defined by

$$B\_{\rm II} = \frac{1}{\mathbf{e}} \sum\_{k=0}^{\infty} \frac{k^n}{k!}. \tag{2}$$

The Bell number *Bn* is the *n*-th moment of the Poisson distribution with parameter equal to 1. Some Bell numbers are listed as follows. *B*<sup>0</sup> = *B*<sup>1</sup> = 1, *B*<sup>2</sup> = 2, *B*<sup>3</sup> = 5, *B*<sup>4</sup> = 15, *B*<sup>5</sup> = 52, *B*<sup>6</sup> = 203, *B*<sup>7</sup> = 877, *B*<sup>8</sup> = 4140, *B*<sup>9</sup> = 21,147, *B*<sup>10</sup> = 115,975, *B*<sup>11</sup> = 678,570, *B*<sup>12</sup> = 4,213,597 and *B*<sup>13</sup> =27,644,437.

For the convenience of the reader, we introduce the following definition and properties of the Bell distribution described in Castellares et al. (2018) [16]:

**Definition 1.** *A discrete r.v. <sup>Z</sup> taking values in* <sup>N</sup><sup>0</sup> <sup>=</sup> {0, 1, 2, ...} *has a Bell distribution with parameter θ* > 0*, denoted as Z* ∼ *Bell*(*θ*)*, if its probability mass function is given by*

$$\Pr(Z=z) = \frac{\theta^z \mathbf{e}^{-\mathbf{e}^\theta + 1} B\_z}{z!}, \quad z \in \mathbb{N}\_0. \tag{3}$$

*where Bz is the Bell number in (2).*

We can see that the Bell distribution has only one parameter, and it belongs to the oneparameter exponential family of distributions. If *Z* ∼ Bell(*θ*), the probability generating function is

$$G\_Z(\mathbf{s}) = E\left(\mathbf{s}^Z\right) = \exp\left(\mathbf{e}^{s\theta} - \mathbf{e}^{\theta}\right), \quad |\mathbf{s}| < 1.$$

The mean and variance of *Z* are

$$E(Z) = \theta \mathbf{e}^{\theta} \text{ and } \text{Var}(Z) = \theta(1+\theta)\mathbf{e}^{\theta}\text{, respectively.} \tag{4}$$

Note that Var(*Z*)/*E*(*Z*) = 1+ *θ* > 1; hence, the Bell distribution is overdispersed, which means the Bell distribution may be suitable for count data with overdispersion in certain situations.

There are some other interesting properties of the Bell distribution, including the following: (i) the Poisson distribution is not nested in the Bell family, but for small values of the parameter, the Bell distribution approaches the Poisson distribution; (ii) it is identifiable, strongly unimodal and infinitely divisible; (iii) a r.v. *Z* ∼ Bell(*θ*) has the same distribution as *Y*<sup>1</sup> + *Y*<sup>2</sup> + ··· + *YN*, where *Yn* has zero-truncated Poisson distribution with parameter *θ*, and *<sup>N</sup>* <sup>∼</sup> Poisson(*e<sup>θ</sup>* <sup>−</sup> <sup>1</sup>). See Castellares et al. (2018) [16] for more properties.

Additionally, there are some papers based on the Bell distribution, and the following are a few related references: Batsidis et al. (2020) [22] proposed and studied a goodnessof-fit test for the Bell distribution, which is consistent against fixed alternatives; Castellares et al. (2020) [23] presented a new two-parameter Bell–Touchard discrete distribution; Lemonte et al. (2020) [24] introduced a zero-inflated Bell regression model for count data; Muhammad et al. (2021) [25] proposed a Bell ridge regression as a solution to the multicollinearity problems.

#### *2.2. Definition and Properties of the BL-INAR(1) Process*

In this section, we give the definition of the BL-INAR(1) process, and its basic statistical properties are derived.

**Definition 2.** *Let* {*Xt*}*t*∈N<sup>0</sup> *be an INAR(1) process according to (1). It refers to a BL-INAR(1) model if the innovations* {*t*}*t*∈N<sup>0</sup> *are a sequence of i.i.d. Bell(θ) r.v.s given by (3); i.e.,*

$$\begin{cases} X\_t = a \diamond X\_{t-1} + \epsilon\_{t\prime} & t \ge 1, \\\ \epsilon\_t \sim Bel(\theta), \end{cases} \tag{5}$$

*where* 0 < *α* < 1 *and θ* > 0*, and <sup>t</sup> is independent of ξ<sup>i</sup> and Xt*−<sup>1</sup> *for t* ≥ 1*.*

According to Equation (4), we know the mean and variance of *<sup>t</sup>* are finite; therefore, the process of {*Xt*}*t*∈N<sup>0</sup> in (5) is an ergodic stationary Markov chain (Du and Li, (1991) [26]) with transition probabilities

$$\begin{split} P\_{ij} &= P(\mathbf{X}\_{t} = i | \mathbf{X}\_{t-1} = j) = P(\mathbf{a} \circ \mathbf{X}\_{t-1} + \boldsymbol{\varepsilon}\_{t} = i | \mathbf{X}\_{t-1} = j) \\ &= \sum\_{m=0}^{\min(i,j)} P(\mathbf{a} \circ \mathbf{X}\_{t-1} = m | \mathbf{X}\_{t-1} = j) P(\boldsymbol{\varepsilon}\_{t} = i - m) \\ &= \sum\_{m=0}^{\min(i,j)} \binom{j}{m} \boldsymbol{\kappa}^{m} (1-\boldsymbol{\kappa})^{j-m} \frac{\theta^{i-m} \mathbf{e}^{-\mathbf{e}^{\boldsymbol{\theta}} + 1} B\_{i-m}}{(i-m)!}, \quad i, j = 0, 1 \ldots 1 \end{split}$$

Further, we can obtain the joint probability function as follows:

$$\begin{split} f(i\_1, i\_2 \dots i\_T) &= P(X\_1 = i\_1, X\_2 = i\_2, \dots, X\_T = i\_T) \\ &= P(X\_1 = i\_1) P(X\_2 = i\_2 | X\_1 = i\_1) \dots P(X\_T = i\_T | X\_{T-1} = i\_{T-1}) \\ &= P(X\_1 = i\_1) \prod\_{k=1}^{T-1} \left[ \sum\_{m=0}^{\min(i\_k, i\_{k+1})} \binom{i\_k}{m} a^m (1-a)^{i\_k - m} P(\varepsilon\_{k+1} = i\_{k+1} - m) \right]. \end{split} \tag{6}$$

The conditional mean, conditional variance, mean, variance, covariance and autocorrelation function of the BL-INAR(1) process are given in the following lemma.

**Lemma 1.** *Let Xt be the process in Definition 2. Then it has the following properties: (i) E*[*Xt*|*Xt*−1] = *<sup>α</sup>Xt*−<sup>1</sup> <sup>+</sup> *μ* <sup>=</sup> *<sup>α</sup>Xt*−<sup>1</sup> <sup>+</sup> *<sup>θ</sup>eθ; (ii)* Var[*Xt*|*Xt*−1] = *<sup>α</sup>*(<sup>1</sup> <sup>−</sup> *<sup>α</sup>*)*Xt*−<sup>1</sup> <sup>+</sup> *<sup>σ</sup>*<sup>2</sup> <sup>=</sup> *<sup>α</sup>*(<sup>1</sup> <sup>−</sup> *<sup>α</sup>*)*Xt*−<sup>1</sup> <sup>+</sup> *<sup>θ</sup>*(<sup>1</sup> <sup>+</sup> *<sup>θ</sup>*)*eθ; (iii) <sup>μ</sup>* :<sup>=</sup> *<sup>E</sup>*[*Xt*] = *<sup>θ</sup>e<sup>θ</sup>* <sup>1</sup>−*<sup>α</sup> ; (iv) <sup>σ</sup>*<sup>2</sup> :<sup>=</sup> Var[*Xt*] = *<sup>θ</sup>e<sup>θ</sup>* (1+*α*+*θ*) <sup>1</sup>−*α*<sup>2</sup> *; (v) γ<sup>k</sup>* := Cov(*Xk*, *Xk*<sup>+</sup>1) = *αkσ*2*; (vi) ρ<sup>k</sup>* := Corr(*Xk*, *Xk*<sup>+</sup>1) = *αk.*

The proof of Lemma 1 is similar to that of Theorem 1 of Qi et al. (2019) [14], so it is omitted.

According to Lemma 1, the dispersion index (Fisher, 1950 [27]) of *Xt* is derived as follows:

$$I\_x := \frac{\sigma^2}{\mu} = 1 + \frac{\theta}{1+a} > 1;$$

thus, the BL-INAR(1) process is suited for overdispersed integer-valued time series.

Additionally, we can obtain the *k*-step ahead conditional mean and *k*-step ahead conditional variance of the BL-INAR(1) process in the following theorem.

**Theorem 1.** *The k-step ahead conditional mean and k-step ahead conditional variance of the BL-INAR(1) process are given, respectively, by:*

$$E(\mathbf{X}\_{t+k}|\mathbf{X}\_t) = \alpha^k \mathbf{X}\_t + \mu\_c \frac{1-\alpha^k}{1-\alpha},$$

*and*

$$\text{Var}(X\_{t+k}|X\_t) = a^k \left(1 - a^k\right) X\_t + \mu\_t \frac{(a - a^k)(1 - a^k)}{1 - a^2} + \sigma\_c^2 \frac{1 - a^{2k}}{1 - a^2}.$$

For more details about the proof of this theorem, see Qi et al. (2019) [14] and Risti´c, Bakouch, and Nasti´c (2009) [6]. It is easy to see that if *<sup>k</sup>* <sup>→</sup> <sup>∞</sup>, *<sup>E</sup>*(*Xt*<sup>+</sup>*k*|*Xt*) <sup>→</sup> *μ* <sup>1</sup> <sup>−</sup> *<sup>α</sup>* <sup>=</sup> *<sup>θ</sup>e<sup>θ</sup>* 1 − *α* and Var(*Xt*<sup>+</sup>*k*|*Xt*) <sup>→</sup> *αμ* <sup>+</sup> *<sup>σ</sup>*<sup>2</sup> <sup>1</sup> <sup>−</sup> *<sup>α</sup>*<sup>2</sup> <sup>=</sup> *<sup>θ</sup>e<sup>θ</sup>* (<sup>1</sup> <sup>+</sup> *<sup>α</sup>* <sup>+</sup> *<sup>θ</sup>*) <sup>1</sup> <sup>−</sup> *<sup>α</sup>*<sup>2</sup> , which are the unconditional mean and unconditional variance of *Xt*, respectively.

#### **3. Estimation of Parameters**

The true values of parameters *α* and *θ* are unknown in practice; therefore, we need to estimate the value of (*α*, *θ*). Sometimes we have to give an estimate of (*α*, *μ*) first to get the estimate of (*α*, *θ*). In this section, we consider three methods for estimating parameters, namely, CLS, YW and CML.

#### *3.1. Conditional Least Squares Estimation*

The CLS estimates of the parameters *α* and *θ* are obtained by

$$\hat{\mathbf{x}}(\hat{\mathbf{x}},\hat{\theta}) = \arg\min \sum\_{t=2}^{T} [\mathbf{X}\_t - E(\mathbf{X}\_t | \mathbf{X}\_{t-1})]^2,$$

and the CLS estimates of (*α*, *μ*) are given by

$$\hat{\mathbf{x}}\_{\text{CLS}} = \frac{(T-1)\sum\_{t=2}^{T} X\_t X\_{t-1} - \sum\_{t=2}^{T} X\_t \sum\_{t=2}^{T} X\_{t-1}}{(T-1)\sum\_{t=2}^{T} X\_{t-1}^2 - \left(\sum\_{t=2}^{T} X\_{t-1}\right)^2},$$

and

$$
\hat{\mu}\_{\rm GLS} = \frac{\sum\_{t=2}^{T} X\_t - \hat{\kappa}\_{\rm GLS} \sum\_{t=2}^{T} X\_{t-1}}{(T-1)(1 - \hat{\kappa}\_{\rm CSS})}.
$$

Then, the CLS estimate of *θ* can be obtained by solving the equation ˆ *θ*CLS*e* ˆ *<sup>θ</sup>*CLS <sup>=</sup> *<sup>μ</sup>*ˆCLS(<sup>1</sup> <sup>−</sup> *<sup>α</sup>*<sup>ˆ</sup> CLS).

According to Theorems 3.1 and 3.2 in Tjøstheim (1986) [28], we can establish the consistency and asymptotic normality of the CLS estimates *α*ˆ *CLS* and *μ*ˆ*CLS* in the following theorem. The proofs of Theorem 2 and the following theorem are given in Appendix A.

**Theorem 2.** *Let α*ˆ *CLS and μ*ˆ*CLS be the CLS estimates of the BL-INAR(1) process; then* (*α*ˆ *CLS*, *μ*ˆ*CLS*) *is strongly consistent for (α, μ); and the asymptotic distribution follows as:*

$$\sqrt{T}(\hbar\_{CLS} - \alpha\_\prime \not\equiv \mu\_{CLS} - \mu)' \stackrel{d}{\rightarrow} \mathbf{N}(0, \Sigma)\_\prime,$$

*where*

$$\Sigma = \begin{bmatrix} \frac{a(1+a)\mu + \sigma\_c^2}{(1-a)^2} & \frac{a\sigma^2}{\mu(1+\mu)}\\ \frac{a\sigma^2}{\mu(1+\mu)} & \frac{a(1-a)(\mu\_3 - 2\mu\sigma^2 - \mu^3) + \sigma\_s^2\sigma^2}{\mu^2(1+\mu)^2} \end{bmatrix}.$$
  $and \ \mu\_3 = E(X\_t^3) = \frac{(1-a^3)\mu^3 + (1+2a^2-3a^3)\mu\sigma^2 + a^2(1-a)\sigma^2}{1-a^3}.$ 

Using the delta method, we can obtain the limit distribution of (*α*ˆ, ˆ *θ*), and we can also know that ˆ *θ* is consistent.

#### *3.2. Yule–Walker Estimation*

Let *X*1, ... , *XT* come from the process {*Xt*} in Definition 2. The sample mean is *X*¯ = <sup>1</sup> *<sup>T</sup>* <sup>∑</sup>*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *Xt*, and the sample autocorrelation function is

$$
\hat{\rho}\_k = \frac{\sum\_{t=1}^{T-k} (X\_t - \bar{X})(X\_{t+k} - \bar{X})}{\sum\_{t=1}^n (X\_t - \bar{X})^2}.
$$

From Lemma 1, we know *ρ<sup>k</sup>* = *αk*, thus the Yule–Walker (YW) estimate of *α* is given by

$$\mathbb{A}\_{\text{YW}} = \widehat{\rho}(1) = \frac{\sum\_{t=1}^{T-1} (X\_t - \widehat{X})(X\_{t+1} - \widehat{X})}{\sum\_{t=1}^{T} (X\_t - \widehat{X})^2},$$

and

$$
\mathfrak{h}\_{\text{YW}} = \mathbb{X}\_{\prime}
$$

with *<sup>μ</sup>* <sup>=</sup> *<sup>θ</sup>e<sup>θ</sup>* 1 − *α* ; then the estimate of *θ* can be obtained.

For asymptotic properties of the YW estimates, Freeland and McCabe (2005) [29] showed that the YW and CLS estimates are asymptotically equivalent for a Poisson INAR(1) process. The next theorem shows that the conclusion holds for our BL-INAR(1) process.

**Theorem 3.** *In the BL-INAR(1) process, CLS and YW estimates are asymptotically equivalent, i.e.,*

$$
\mathfrak{A}\_{\rm CLS} - \mathfrak{A}\_{\rm YW} = o\_p(T^{-\frac{1}{2}}) \quad \text{and} \quad \mathfrak{A}\_{\rm CLS} - \mathfrak{A}\_{\rm YW} = o\_p(T^{-\frac{1}{2}}) .
$$

*3.3. Conditional Maximum Likelihood Estimation*

According to the joint probability function (6), the likelihood function can be obtained as:

$$\begin{split} f(\mathbf{x}\_{1}, \mathbf{x}\_{2}, \dots, \mathbf{x}\_{T}) &= P(\mathbf{X}\_{1} = \mathbf{x}\_{1}) \prod\_{t=1}^{T-1} P(\mathbf{X}\_{t+1} = \mathbf{x}\_{t+1} | \mathbf{X}\_{t} = \mathbf{x}\_{t}) \\ &= f(\mathbf{x}\_{1}) \prod\_{t=1}^{T-1} \left[ \sum\_{m=0}^{\min(\mathbf{x}\_{t}, \mathbf{x}\_{t+1})} \binom{\mathbf{x}\_{t}}{m} a^{m} (1-a)^{\mathbf{x}\_{t} - m} P(\mathbf{x}\_{t+1} = \mathbf{x}\_{t+1} - m) \right]. \end{split}$$

To condition on variable *X*1, we can obtain the conditional log likelihood function as:

$$L(\mathfrak{x}, \theta) = \sum\_{t=1}^{T-1} \log P(X\_{t+1} = \mathfrak{x}\_{t+1} | X\_t = \mathfrak{x}\_t)\_\prime$$

the CML estimates of (*α*, *θ*) are the values of (*α*ˆ CML, ˆ *θ*CML) obtained by maximizing the conditional log likelihood function *L*(*α*, *θ*). It is easy to check that the BL-INAR(1) process satisfies conditions (C1)–(C6) of Franke and Seligmann (1993) [30]; thus, the CML estimates (*α*ˆ CML, ˆ *θ*CML) are consistent and asymptotically normal. The proof is similar to those of Theorems 22.4 and 22.5 of Franke and Seligmann (1993) [30], so it is omitted.

#### **4. Simulation**

A Monte Carlo simulation was conducted to study the performances of the CLS, YW, and CML estimates of the BL-INAR(1) model. The CML estimates were obtained by using the BFGS quasi-Newton nonlinear optimization algorithm with numerical derivatives. We considered YW estimates as initial values for the algorithm. The simulation was conducted using R programming language, and the size of the sample was 100, 250, 500, or 1000. The number of replicates was 1000. For the true values of parameters, we considered *α* = 0.25, 0.5, and 0.75 and *θ* = 0.5, and 1.5.

First, we give the Q–Q plots of the CLS, YW, and CML estimates for the BL-INAR(1) model with sample size *T* = 1000, *α* = 0.5, and *θ* = 1.5 in Figure 1. From the six Q– Q plots, we can see that they contain roughly straight lines; i.e., the estimates of the parameters are normally distributed. Then, the numerical simulation results are presented in Tables 1 and 2. By comparing the two tables, we can find that with the same *θ* and *T*, the mean squared error (MSE) for the estimate of *θ* increased with the increase of *α*, but the MSE for the estimate of *α* decreased. Additionally, the MSE for the estimate of *θ* increased with the increase of *θ* with the same *α* and *T*, but the MSE for the estimate of *α* decreased. Furthermore, we can observe that the estimates of CLS and YW are similar, and the bias tended toward zero for all estimates as the sample size increased. The estimates of CML converged faster to the true parameter values. We conclude that the CML estimates produced the smallest mean square errors, and CML performed better than CLS and YW.

**Table 1.** Empirical means and mean squared errors (in parentheses) of the estimates of the parameters for some values of *α* and *θ* of the BL-INAR(1) model.



**Table 2.** Empirical means and mean squared errors (in parentheses) of the estimates of the parameters for some values of *α* and *θ* of the BL-INAR(1) model.

**Figure 1.** The Q–Q plots of the CLS, YW, and CML estimates for the BL-INAR(1) model with sample size *T* = 1000.

#### **5. Real Data Examples**

In this section, we present two applications of the BL-INAR(1) model to real datasets, and compare it with the P-INAR(1), G-INAR(1), PL-INAR(1), NB-INAR(1), ZIP-INAR(1), DP-INAR(1), and GP-INAR(1) models. Results of the comparison are discussed here as well.

#### *5.1. Disconduct Data*

The first dataset is a monthly count of disconduct in the first census tract in Rochester, which can be obtained from Available online: http://www.forecastingprinciples.com (accessed on 8 May 2012). The data comprise 132 observations (*T* = 132) starting from January 1991 and ending in December 2001.

The time plot, histogram, autocorrelation function (ACF), and partial autocorrelation function (PACF) are provided in Figure 2. We applied the Ljung–Box test (Ljung and Box (1978) [31]) to check whether this time series dataset has any autocorrelation. The *p*-value of the Ljung–Box test is 1.317 <sup>×</sup> <sup>10</sup>−5, which is less than 0.05. This means that the time series data have some autocorrelation, and according to the PACF diagram, the data are first-order autocorrelated, which shows that the AR(1)-type process is appropriate for modeling this dataset.

The sample mean and variance of the data are *X*¯ = 1.6288 and *S*<sup>2</sup> *<sup>X</sup>* = 2.4455, respectively. Thus, we got the dispersion index ˆ*Ix* = *S*<sup>2</sup> *<sup>X</sup>*/*X*¯ = 1.5014. According to the overdispersion test of Schweer and Weiß (2014) [11], the critical value of the data is 1.1994. The dispersion index ˆ*Ix* exceeds the critical value, which means that the equidispersed P-INAR(1) model is not a good choice for the data.

**Figure 2.** The time plot, histogram, ACF, and PACF of disconduct data.

For comparison, we calculated the CML estimates of parameters, and the AIC, BIC, CAIC, HQIC, fitted mean, and fitted variance of the BL-INAR(1) model, the P-INAR(1) model, the G-INAR (1) model, the PL-INAR(1) model, the ZIP-INAR(1) model, the NB-INAR(1) model, the DP-INAR(1) model, and the GP-INAR(1) model. Among the eight models, the first four are two-parameter models and the last four are three-parameter models. The results are presented in Table 3. We found that the AIC, BIC, CAIC, and HQIC of the BL-INAR(1) model were smaller than those of others. We also found that the fitted means of all eight models were near to the sample mean, and the fitted mean of the PL-INAR(1) model was the closest to the sample mean. In terms of fitted variance, Table 3 shows that the fitted variance of the BL-INAR(1) model performed better than those of the other seven models.


**Table 3.** CML estimates, AIC, BIC, CAIC, HQIC, fitted mean, fitted variance, and RMSE for eight INAR(1) models of disconduct data.

<sup>1</sup> Bold text means the smallest value in the column. <sup>2</sup> Bold text means that this value is the closest in the column to the sample value described in the text.

> For the prediction, we used the first 126 observations to estimate the parameters, and then predicted the last six observations. The predicted values of the disconduct data could be given by *<sup>E</sup>*(*Xt*<sup>+</sup>*<sup>k</sup>* <sup>|</sup> *Xt*) <sup>=</sup> *<sup>α</sup>kXt* <sup>+</sup> *μ* <sup>1</sup> <sup>−</sup> *<sup>α</sup><sup>k</sup>* <sup>1</sup> <sup>−</sup> *<sup>α</sup>* . For a further comparison of models, we calculated the root mean square values of the prediction errors (RMSEs) for the last 6 months of the data, and the RMSE is defined as RMSE = 51 <sup>6</sup> <sup>∑</sup><sup>6</sup> *<sup>k</sup>*=1(*Xt*<sup>+</sup>*<sup>k</sup>* <sup>−</sup> *<sup>X</sup>*ˆ*t*+*k*)2. We present the RMSE results of eight models in the last column of Table 3. From the table, we can see that the RMSE of the G-INAR(1) model was best. The RMSE of the BL-INAR(1) model is smaller than those of the P-INAR(1) model, the NB-INAR(1) model, the DP-INAR(1) model, and the GP-INAR(1) model; and a little larger than those of the G-INAR(1) model, the PL-INAR(1) model, and the ZIP-INAR(1) model. Although the fitted mean and RMSE of the BL-INAR(1) model are not the best, it is the best choice under the other five criteria. Further, we analyze the Pearson residuals, and Figure 3 plots the ACF, PACF, and Q–Q plots of residuals. The ACF and PACF graphs show no correlation between residuals, which is supported by the result of the Ljung–Box test with a *p*-value of 0.05251 > 0.05. The Q–Q plots appear to be roughly normally distributed, as we expected. Hence, we can conclude that the BL-INAR(1) model is the most suitable among those available for this dataset.

**Figure 3.** The ACF, PACF, and Q–Q plots of the Pearson residual for disconduct data using the BL-INAR(1) model.

#### *5.2. Strikes Data*

The second dataset, which was analyzed by Weiß (2010) [32], is the monthly number of work stoppages (strikes and lock-outs) of 1000 or more workers for the period 1994–2002. It was published by the US Bureau of Labor Statistics and can be obtained by online at the address Available online: http://www.bls.gov/wsp/ (accessed on 8 May 2012). The data contain 108 observations, and the time plot, histogram, ACF, and PACF are provided in Figure 4. As with the previous example, the Ljung–Box test was used to check whether the strike data have any autocorrelation. The *<sup>p</sup>*-value of the Ljung–Box test was 2.372 <sup>×</sup> <sup>10</sup>−8, which shows that the time series data have some autocorrelation, and according to the PACF diagram, it is also first-order autocorrelated, so an AR(1)-type process is appropriate for modeling this dataset.

The sample mean, variance, and dispersion index were calculated to be 4.9444, 7.8488, and 1.5874, respectively. According to the overdispersion test, the critical value of the data is 1.2808, and we observe that it was inappropriate to use the P-INAR(1) model to fit the data. The CML estimates, AIC, BIC, CAIC, HQIC, fitted mean, and fitted variance for the BL-INAR(1), P-INAR(1), G-INAR(1), PL-INAR(1), NB-INAR(1), ZIP-INAR(1), DP-INAR(1), and GP-INAR(1) models were obtained and are shown in Table 4. We see that the AIC, BIC, CAIC, and HQIC of the BL-INAR(1) model are smaller than those of others, and the fitted mean of the BL-INAR(1) model is not much different from those of the other seven models. Further, we can see that the BL-INAR(1) model performed better than others when calculating the fitted variance. Similarly to the previous example, the first 102 observations were used to estimate the parameters and predict the last six observations. The RMSE of the predictions is also presented in Table 4. We can observe that the RMSE of the G-INAR(1) model is the smallest; however, it is only 0.05 less than the RMSE of the BL-INAR(1) model. As in the previous example, although the BL-INAR(1) model was not the best under the fitted mean and RMSE criteria, it performed best under the other five criteria. Additionally, we show the Pearson residuals analysis. Figure 5 gives the ACF, PACF, and Q–Q plots of the residuals. We found that there is no evidence of any significant correlation within the residuals, a finding also supported by the Ljung–Box test with a *p*-value of 0.9522, which is greater than 0.05. The Q–Q plot also appears to be roughly normally distributed. Thus, according to above discussions and its simplicity, we can conclude that the BL-INAR(1) model was the most appropriate.

**Figure 4.** The time plot, histogram, ACF, and PACF of data on strikes.

**Table 4.** CML estimates, AIC, BIC, CAIC, HQIC, fitted mean, fitted variance, and RMSE from eight INAR(1) models of strike data.


<sup>1</sup> Bold means the smallest value in the column. <sup>2</sup> Bold means that this value is the closest in the column to the sample value described in the text.

> Combined with the above two examples and the advantages of the Bell distribution with one parameter and a simple form, the BL-INAR(1) model is competitive with the other seven models.

**Figure 5.** The ACF, PACF, and Q–Q plots of the Pearson residual for strike data using the BL-INAR(1) model.

#### **6. Conclusions**

A new INAR(1) model with Bell innovations based on the binomial thinning operator was introduced in this paper. Based on the overdispersed property of the Bell distribution, we found that the BL-INAR(1) model is suitable for overdispersed data. Some basic properties of the model were obtained, such as transition probabilities, conditional mean, conditional variance, mean, variance, covariance, autocorrelation function, and *k*-step ahead conditional mean and variance. For unknown parameters, CLS, YW, and CML methods are used to estimate them. The Q–Q plots showed that the estimates of the parameters are normally distributed. The simulated results revealed that the CML estimates of parameters of the BL-INAR(1) model were better than the CLS and YW estimates. Finally, by comparing the AIC values, BIC values, CAIC values, HQIC values, fitted means, fitted variances, and RMSE values of the predictions among eight INAR(1) models, two real datasets both showed that the BL-INAR(1) model fits better than other INAR(1) models. The analysis of residuals also shows that the BL-INAR(1) model provided adequate fits to those datasets.

Although there are many overdispersed INAR(1) models, some interesting properties of the Bell distribution, such as having one parameter, infinitely divisibility, having a simple probability mass function, belonging to the one-parameter exponential family of distributions, and for a parameter with a small value, having the Bell distribution approach the Poisson distribution, make the BL-INAR(1) model competitive. Some extended distributions of the Bell distribution, such as the zero-inflated Bell distribution and the Bell–Touchard distribution, provide ideas for us to study related INAR models in the future.

**Author Contributions:** Conceptualization, F.Z.; methodology, F.Z.; software, J.H.; validation, J.H. and F.Z.; formal analysis, J.H.; investigation, J.H. and F.Z.; resources, F.Z.; data curation, J.H.; writing original draft preparation, J.H.; writing—review and editing, F.Z.; visualization, J.H.; supervision, F.Z.; project administration, F.Z.; funding acquisition, F.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** Zhu's work is supported by National Natural Science Foundation of China, grant numbers 11871027 and 11731015.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The disconduct data and the strike data are available at http://www. forecastingprinciples.com (accessed on 1 June 2021) and http://www.bls.gov/wsp/ (accessed on 1 June 2021 ), respectively .

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

#### *Appendix A.1. Proof of Theorem 2*

To prove this theorem, we need to show that the conditions given in Theorems 3.1 and 3.2 of Tjøstheim (1986) [28] are satisfied.

Define *φ* = (*α*, *μ*), and the true value of the unknown parameter *φ*<sup>0</sup> = (*α*0, *μ*0). According to Lemma 1, we know that *E*[*X*<sup>2</sup> *<sup>t</sup>* ] < ∞ and that *E*[*Xt*|*Xt*−1] is almost surely three times differentiable in an open set Φ containing *φ*0.

$$\begin{aligned} &\text{Condition 1:}\\ &E\left\{ \left| \frac{\partial E(X\_t \mid X\_{t-1})}{\partial \phi\_i} (\Phi\_0) \right|^2 \right\} < \infty \text{ and } E\left\{ \left| \frac{\partial^2 E(X\_t \mid X\_{t-1})}{\partial \phi\_i \partial \phi\_j} (\Phi\_0) \right|^2 \right\} < \infty, \text{ for } i, j = 1, 2. \\ &\text{According to } E(X\_t \mid X\_{t-1}) = \mu X\_{t-1} + (1 - \mu)\mu, \text{ we have} \end{aligned}$$

$$\text{According to } E(X\_t \mid X\_{t-1}) = \mathfrak{a}X\_{t-1} + (1 - \mathfrak{a})\mu \text{, we have}$$

$$\begin{aligned} E\left\{ \left| \frac{\partial E(X\_t \mid X\_{t-1})}{\partial \boldsymbol{\alpha}} (\boldsymbol{\Phi}\_0) \right|^2 \right\} &= E\left\{ |X\_{t-1} - \mu\_0|^2 \right\} = \text{Var}(X\_{t-1}) < \infty \text{ and } \\ E\left\{ \left| \frac{\partial E(X\_t \mid X\_{t-1})}{\partial \boldsymbol{\mu}} (\boldsymbol{\Phi}\_0) \right|^2 \right\} &= E\left\{ |1 - \boldsymbol{\alpha}\_0|^2 \right\} = (1 - \boldsymbol{\alpha}\_0)^2 < \infty. \end{aligned}$$

For the second derivative of *E*(*Xt* | *Xt*−1), we have

$$\begin{aligned} &E\left\{ \left| \frac{\partial^2 E(X\_t \mid X\_{t-1})}{\partial a^2} (\Phi\_0) \right|^2 \right\} = E\left\{ \left| \frac{\partial^2 E(X\_t \mid X\_{t-1})}{\partial \mu^2} (\Phi\_0) \right|^2 \right\} = 0 < \infty \text{ and } \\ &E\left\{ \left| \frac{\partial^2 E(X\_t \mid X\_{t-1})}{\partial a \partial \mu} (\Phi\_0) \right|^2 \right\} = 1 < \infty. \end{aligned}$$

Condition 2:

The vectors *∂E*(*Xt* | *Xt*−1)(*θ*0)/*∂φi*, *i*, *j* = 1, 2 are linearly independent in the sense that if *a*<sup>1</sup> and *a*<sup>2</sup> are arbitrary real numbers such that

$$E\left\{ \left| \sum\_{i=1}^{2} a\_i \frac{\partial E(X\_t \mid X\_{t-1})}{\partial \phi\_i} (\phi\_0) \right|^2 \right\} = 0,$$

then *a*<sup>1</sup> = *a*<sup>2</sup> = 0. Note that

$$\begin{split} E\left\{ \left| a\_{1} \frac{\partial E(X\_{t} \mid X\_{t-1})}{\partial \alpha} (\Phi\_{0}) + a\_{2} \frac{\partial E(X\_{t} \mid X\_{t-1})}{\partial \mu} (\Phi\_{0}) \right|^{2} \right\} = 0 \Rightarrow \\ E\left\{ \left| a\_{1} (X\_{t-1} - \mu\_{0}) + a\_{2} (1 - \alpha\_{0}) \right|^{2} \right\} = 0 \Rightarrow \\ \underbrace{a\_{1}^{2} \text{Var}(X\_{t-1})}\_{\geq 0} + \underbrace{a\_{2}^{2} (1 - \alpha\_{0})^{2}}\_{\geq 0} = 0 \Rightarrow \\ a\_{1}^{2} \underbrace{\text{Var}(X\_{t-1})}\_{> 0} = 0 \text{ and } a\_{2}^{2} \underbrace{\left( 1 - \alpha\_{0} \right)^{2}}\_{\geq 0} = 0 \Rightarrow \\ a\_{1}^{2} = 0 \text{ and } a\_{2}^{2} = 0. \end{split}$$

Then *a*<sup>1</sup> = *a*<sup>2</sup> = 0.

Condition 3: For *<sup>φ</sup>* <sup>∈</sup> <sup>Φ</sup>, there exist functions *<sup>G</sup>ijk <sup>t</sup>*−1(*X*1,..., *Xt*−1) and *<sup>H</sup>ijk <sup>t</sup>* (*X*1,..., *Xt*) for *i*, *j* = 1, 2 such that

$$\begin{split} \mathcal{M}\_{t-1}^{ijk}(\boldsymbol{\Phi}) &= \left| \frac{\partial E(\boldsymbol{X}\_{t} \mid \boldsymbol{X}\_{t-1})}{\partial \boldsymbol{\Phi}\_{i}} (\boldsymbol{\Phi}) \frac{\partial^{2} E(\boldsymbol{X}\_{t} \mid \boldsymbol{X}\_{t-1})}{\partial \boldsymbol{\Phi}\_{j} \partial \boldsymbol{\Phi}\_{k}} (\boldsymbol{\Phi}) \right| \leq \mathcal{G}\_{t-1}^{ijk}, \; \mathrm{E} \left( \mathcal{G}\_{t-1}^{ijk} \right) < \infty, \\\ N\_{t}^{ijk}(\boldsymbol{\Phi}) &= \left| \{ \mathcal{X}\_{t} - \mathrm{E}(\mathcal{X}\_{t} \mid \boldsymbol{X}\_{t-1}) (\boldsymbol{\Phi}) \} \frac{\partial^{3} \mathrm{E}(\mathcal{X}\_{t} \mid \mathcal{X}\_{t-1})}{\partial \boldsymbol{\Phi}\_{i} \partial \boldsymbol{\Phi}\_{j} \partial \boldsymbol{\Phi}\_{k}} (\boldsymbol{\Phi}) \right| \leq H\_{t}^{ijk}, \; \mathrm{E} \left( H\_{t}^{ijk} \right) < \infty. \end{split}$$

Note that *M*<sup>111</sup> *<sup>t</sup>*−1(*φ*) = *<sup>M</sup>*<sup>122</sup> *<sup>t</sup>*−1(*φ*) = *<sup>M</sup>*<sup>211</sup> *<sup>t</sup>*−1(*φ*) = *<sup>M</sup>*<sup>222</sup> *<sup>t</sup>*−1(*φ*) = 0 and

$$\begin{aligned} M\_{t-1}^{112}(\phi) &= M\_{t-1}^{121}(\phi) = |X\_{t-1} - \mu|, \\ M\_{t-1}^{212}(\phi) &= M\_{t-1}^{221}(\phi) = |\alpha - 1| < 1; \end{aligned}$$

then we can choose *Gijk <sup>t</sup>*−1(*φ*) = (*Xt*−<sup>1</sup> <sup>−</sup> *<sup>μ</sup>*) <sup>2</sup> <sup>+</sup> 1, <sup>∀</sup>*i*, *<sup>j</sup>*, *<sup>k</sup>* <sup>=</sup> 1, 2, which guarantees that *Mijk <sup>t</sup>*−1(*φ*) <sup>&</sup>lt; *<sup>G</sup>ijk <sup>t</sup>*−1(*φ*) and *<sup>E</sup>*(*Gijk <sup>t</sup>*−1) = Var(*Xt*−<sup>1</sup> <sup>+</sup> <sup>1</sup>) <sup>&</sup>lt; <sup>∞</sup>.

For *Nijk <sup>t</sup>* (*φ*), it is easy to know that *<sup>N</sup>ijk <sup>t</sup>* (*φ*) = 0, <sup>∀</sup>*i*, *<sup>j</sup>*, *<sup>k</sup>* <sup>=</sup> 1, 2. So we choose *<sup>H</sup>ijk <sup>t</sup>* (*φ*) = 0, <sup>∀</sup>*i*, *<sup>j</sup>*, *<sup>k</sup>* <sup>=</sup> 1, 2 to satisfy *<sup>N</sup>ijk <sup>t</sup>* (*φ*) <sup>&</sup>lt; *<sup>H</sup>ijk <sup>t</sup>* (*φ*) and *<sup>E</sup>*(*Hijk <sup>t</sup>* ) = 0 < ∞.

The above three conditions ensure that (*α*ˆ *CLS*, *μ*ˆ*CLS*) is a strongly consistent estimator for (*α*, *μ*). According to Theorem 3.2 in Tjøstheim (1986) [28], the asymptotic distribution of (*α*ˆ *CLS*, *μ*ˆ*CLS*) is

$$\sqrt{T}(\mathbb{A}\_{CLS} - \alpha, \,\,\mu\_{CLS} - \mu)' \xrightarrow{d} \mathbf{N}(0, \Sigma), 0$$

where Σ = *U*−1*RU*−1,

$$\begin{aligned} \mathcal{U} &= E\left\{\frac{\partial E(\mathbf{X}\_{t} \mid \mathbf{X}\_{t-1})^{T}}{\partial \boldsymbol{\Phi}}(\boldsymbol{\Phi}) \cdot \frac{\partial E(\mathbf{X}\_{t} \mid \mathbf{X}\_{t-1})}{\partial \boldsymbol{\Phi}}(\boldsymbol{\Phi})\right\}, \\\mathcal{R} &= E\left\{\frac{\partial E(\mathbf{X}\_{t} \mid \mathbf{X}\_{t-1})^{T}}{\partial \boldsymbol{\Phi}}(\boldsymbol{\Phi}) f\_{t|t-1}(\boldsymbol{\Phi}) \frac{\partial E(\mathbf{X}\_{t} \mid \mathbf{X}\_{t-1})}{\partial \boldsymbol{\Phi}}(\boldsymbol{\Phi})\right\}. \end{aligned}$$

and

$$f\_{t|t-1}(\boldsymbol{\Phi}) = E\left\{ (\mathbf{X}\_t - E(\mathbf{X}\_t \mid \mathbf{X}\_{t-1}))(\mathbf{X}\_t - E(\mathbf{X}\_t \mid \mathbf{X}\_{t-1}))^T \mid \mathbf{X}\_{t-1} \right\}.$$

We can then find that

$$
\Sigma = \begin{bmatrix}
\frac{a(1+a)\mu + \sigma\_\epsilon^2}{(1-a)^2} & \frac{a\sigma^2}{\mu(1+\mu)} \\
\frac{a\sigma^2}{\mu(1+\mu)} & \frac{a(1-a)(\mu\_3 - 2\mu\sigma^2 - \mu^3) + \sigma\_\epsilon^2\sigma^2}{\mu^2(1+\mu)^2}
\end{bmatrix},
$$

where *μ*<sup>3</sup> = *E*(*X*<sup>3</sup> *<sup>t</sup>* ) = (<sup>1</sup> <sup>−</sup> *<sup>α</sup>*3)*μ*<sup>3</sup> + (<sup>1</sup> <sup>+</sup> <sup>2</sup>*α*<sup>2</sup> <sup>−</sup> <sup>3</sup>*α*3)*μσ*<sup>2</sup> <sup>+</sup> *<sup>α</sup>*2(<sup>1</sup> <sup>−</sup> *<sup>α</sup>*)*σ*<sup>2</sup> <sup>1</sup> <sup>−</sup> *<sup>α</sup>*<sup>3</sup> , which follows from the following derivation:

$$\begin{aligned} \mu\_3 &= E[X\_t^3] = E[X\_t^2(\boldsymbol{\alpha} \odot \mathbf{X}\_{t-1} + \boldsymbol{\epsilon}\_t)] = E[E[X\_t^2(\boldsymbol{\alpha} \odot \mathbf{X}\_{t-1} + \boldsymbol{\epsilon}\_t) | \mathbf{X}\_{t-1}]] \\ &= E[\boldsymbol{\alpha} \mathbf{X}\_{t-1} E[\mathbf{X}\_t^2 | \mathbf{X}\_{t-1}] + \mu\_\mathbf{c} E[\mathbf{X}\_t^2 | \mathbf{X}\_{t-1}]] \\ &= E[\boldsymbol{\alpha} \mathbf{X}\_{t-1} E[\mathbf{X}\_t^2 | \mathbf{X}\_{t-1}]] + \mu\_\mathbf{c} E[\mathbf{X}\_t^2], \end{aligned}$$

according to Lemma 1,

$$\begin{aligned} E[X\_t^2 | X\_{t-1}] &= \text{Var}[X\_t | X\_{t-1}] + (E[X\_t | X\_{t-1}])^2 \\ &= \alpha^2 X\_t^2 + (\alpha(1 - \alpha) + 2\alpha \mu\_{\mathfrak{c}}) X\_t + \mu\_{\mathfrak{c}}^2 + \sigma\_{\mathfrak{c}}^2; \end{aligned}$$

then, we have

$$\begin{split} \mu\_{3} &= \mathbb{E}[aX\_{t-1}\mathbb{E}[X\_{t}^{2}|X\_{t-1}]] + \mu\_{\mathfrak{e}}\mathbb{E}[X\_{t}^{2}] \\ &= a^{3}\mathbb{E}[X\_{t-1}^{3}] + a^{2}(1-a)\mathbb{E}[X\_{t-1}^{2}] + 2a^{2}\mu\_{\mathfrak{e}}\mathbb{E}[X\_{t-1}^{2}] + a\mu(\mu\_{\mathfrak{e}}^{2} + \sigma\_{\mathfrak{e}}^{2}) + \mu\_{\mathfrak{e}}\mathbb{E}[X\_{t}^{2}] \\ &\quad \text{(where } \mu\_{\mathfrak{e}} = (1-a)\mu \text{ and } \sigma\_{\mathfrak{e}}^{2} = (1-a^{2})\sigma^{2} - a(1-a)\mu ) \\ &= a^{3}\mu\_{3} + (1-a^{3})\mu^{3} + (1+2a^{2}-3a^{3})\mu\sigma^{2} + a^{2}(1-a)\sigma^{2}. \end{split}$$

Thus, we obtain that

$$
\mu\_3 = \frac{(1 - \alpha^3)\mu^3 + (1 + 2\alpha^2 - 3\alpha^3)\mu\sigma^2 + \alpha^2(1 - \alpha)\sigma^2}{1 - \alpha^3}.
$$

*Appendix A.2. Proof of Theorem 3*

The proof is similar to that of Theorem 4.2 in Cunha et al. (2021) [15]. For estimator *α*ˆ, we have

$$\begin{split} \sqrt{T}(\hat{\mathbf{a}}\_{\text{VW}} - \hat{\mathbf{a}}\_{\text{CLS}}) &= \frac{(D\_{\text{CLS}} - D\_{\text{VW}})}{D\_{\text{CLS}}D\_{\text{VW}}} \cdot \frac{\sum\_{t=2}^{T} X\_{t}X\_{t-1}}{\sqrt{T}} \\ &- \frac{\left(D\_{\text{CLS}} \sum\_{t=2}^{T} \frac{X\_{t}}{T} - D\_{\text{VW}} \sum\_{t=2}^{T} \frac{X\_{t}}{T-1}\right)}{D\_{\text{CLS}}D\_{\text{VW}}} \cdot \frac{\sum\_{t=1}^{T} X\_{t}}{\sqrt{T}} \\ &- \frac{D\_{\text{VW}}}{D\_{\text{CLS}}D\_{\text{VW}}} \cdot \sum\_{t=2}^{T} \frac{X\_{t}}{T-1} \frac{X\_{T}}{\sqrt{T}} + \frac{D\_{\text{CLS}}}{D\_{\text{CLS}}D\_{\text{VW}}} \cdot \underbrace{\mathbb{X}\left(X\_{T} - \bar{X}\right)}\_{\sqrt{T}} \\ &= o\_{p}(1)O\_{p}(1) - o\_{p}(1)O\_{p}(1) - O\_{p}(1)O\_{p}(1)o\_{p}(1) + O\_{p}(1)O\_{p}(1)o\_{p}(1) \\ &= o\_{p}(1), \end{split}$$

where *<sup>D</sup>*CLS <sup>=</sup> <sup>1</sup> *T* ∑*<sup>T</sup> <sup>t</sup>*=<sup>2</sup> *X*<sup>2</sup> *<sup>t</sup>*−<sup>1</sup> <sup>−</sup> <sup>1</sup> *T* − 1 ∑*<sup>T</sup> <sup>t</sup>*=<sup>2</sup> *Xt*−<sup>1</sup> 2 and *<sup>D</sup>*YW <sup>=</sup> <sup>1</sup> *<sup>T</sup>* <sup>∑</sup>*<sup>T</sup> <sup>t</sup>*=1(*Xt* <sup>−</sup> *<sup>X</sup>*¯) <sup>2</sup> = 1 *<sup>T</sup>* <sup>∑</sup>*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *X*<sup>2</sup> *<sup>t</sup>* <sup>−</sup> *<sup>X</sup>*¯ 2. For estimator <sup>ˆ</sup> *<sup>θ</sup>*, we only need to prove <sup>√</sup>*<sup>T</sup>* <sup>−</sup> <sup>1</sup>(*μ*<sup>ˆ</sup> ,CLS <sup>−</sup> *<sup>μ</sup>*<sup>ˆ</sup> ,YW) is *op*(1).

$$\begin{split} &\quad\sqrt{T-1}(\hat{\mu}\_{\text{c.CLS}}-\hat{\mu}\_{\text{c.}\text{YW}}) \\ &=\sqrt{T-1}\left(\frac{\sum\_{t=2}^{T}X\_{t}-\hat{\text{x}}\_{\text{CLS}}\sum\_{t=2}^{T}X\_{t-1}}{T-1}-\bar{X}(1-\hat{\text{x}}\_{\text{YW}})\right) \\ &=\frac{1}{\sqrt{T-1}}\left(\sum\_{t=2}^{T}X\_{t}-\hat{\text{x}}\_{\text{CLS}}\sum\_{t=2}^{T}X\_{t-1}-T\bar{X}(1-\hat{\text{x}}\_{\text{YW}})+\bar{X}(1-\hat{\text{x}}\_{\text{YW}})\right) \\ &=\frac{\hat{\text{x}}\_{\text{CLS}}X\_{T}-X\_{1}}{\sqrt{T-1}}-\sqrt{\frac{T}{T-1}}\cdot\sqrt{T}(\hat{\text{x}}\_{\text{CLS}}-\hat{\text{x}}\_{\text{YW}})\ \bar{X}+\frac{\bar{X}}{\sqrt{T-1}}(1-\hat{\text{x}}\_{\text{YW}}) \\ &=o\_{p}(1)-o\_{p}(1)+o\_{p}(1)=o\_{p}(1). \end{split}$$

#### **References**

