*Article* **Monitoring the Zero-Inflated Time Series Model of Counts with Random Coefficient**

**Cong Li 1, Shuai Cui <sup>1</sup> and Dehui Wang 1,2,\***


**Abstract:** In this research, we consider monitoring mean and correlation changes from zero-inflated autocorrelated count data based on the integer-valued time series model with random survival rate. A cumulative sum control chart is constructed due to its efficiency, the corresponding calculation methods of average run length and the standard deviation of the run length are given. Practical guidelines concerning the chart design are investigated. Extensive computations based on designs of experiments are conducted to illustrate the validity of the proposed method. Comparisons with the conventional control charting procedure are also provided. The analysis of the monthly number of drug crimes in the city of Pittsburgh is displayed to illustrate our current method of process monitoring.

**Keywords:** CUSUM control chart; INAR-type time series; statistical process monitoring; random survival rate; zero-inflation

**Citation:** Li, C.; Cui, S.; Wang, D. Monitoring the Zero-Inflated Time Series Model of Counts with Random Coefficient. *Entropy* **2021**, *23*, 372. https://doi.org/10.3390/e23030372

Academic Editor: Christian H. Weiss

Received: 19 January 2021 Accepted: 17 March 2021 Published: 20 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

This work is motivated by an empirical analysis and process control of a monthly drug crime series, which contains excess zeros (over 40%) and shows clear serial dependence (see Section 5 for more details). To solve this problem, an appropriate integer-valued model is selected, further, control charts based on this model are developed. Among kinds of integer-valued models, a specific kind featured with first-order integer-valued autoregressive (INAR(1)) models plays an very important role and has been widely studied in the literature. In reality, serial dependence among the count data have been demonstrated to arise extensively in practice, typical examples are infectious disease counts, defect counts and unemployment counts, etc. These data are important indicators of the epidemic study, quality control and economics analysis, and the process monitoring is essential to detect the shifts in them.

The first INAR(1) model proposed by Al-Osh and Alzaid [1] is in the following form

$$X\_t = \mathfrak{a} \circ X\_{t-1} + \varepsilon\_{t\prime} \quad t = 1, 2, \cdots, \omega$$

where the binomial thinning operator "◦ is defined by Steutel and Van Harn [2], *α* ◦ *Xt*−<sup>1</sup> = ∑*Xt*−<sup>1</sup> *<sup>i</sup>*=<sup>1</sup> *Yi*, *α* ∈ (0, 1), {*Yi*}*<sup>N</sup>* is a sequence of independent and identically distributed (i.i.d.) random variables with Bernoulli(*α*) distribution; and {*εt*}*<sup>N</sup>* is a sequence of i.i.d. random variables, independent of all {*Yi*}. The INAR(1) model is currently applied in various kinds of real-world problems because of its good interpretability. As one example, we let *Xt* represent the number of patients of an infectious disease in a community at time *t*, *ε<sup>t</sup>* the number of new patients entering the community at time *t*, and suppose each patient at time *t* − 1 survives at time *t* with survival probability *α*. As for the crime data, *α* ◦ *Xt*−<sup>1</sup> can be considered as the number of re-offendings provoked by *Xt*−<sup>1</sup> with probability *α*. Depending on the nature of this kind of observed data, the INAR(1) models have been modified and

generalized with respect to their orders (Risti´c and Nasti´c [3], Nasti´c, Laketa and Risti´c [4]), dimensions (Pedeli and Karlis [5], Khan, Cekim and Ozel [6]), marginal distributions (Alzaid and Al-Osh [7], Alzaid and Al-Osh [8], Jazi, Jones and Lai [9], Risti´c, Nasti´c and Bakouch [10], Barreto-Souza [11]), thinning operators (Risti´c, Bakouch and Nasti´c[12], Liu and Zhu [13]), and mixed models (Risti´c and Nasti´c [3], Li, Wang and Zhang [14], Orozco, Sales, Fernández and Pinho [15]). For more literature, we refer to review papers (Weiß [16], Scotto, Weiß and Gouveia [17]). Differing with the models that based on a fixed survival rate *α*, Zheng, Basawa and Datta [18] proposed the random coefficient INAR(1) model, supposing that the parameter *α* may be affected by various environmental factors, and could vary randomly over time. Some of the generalizations of random coefficient INAR models can also be found in Kang and Lee [19] and Zhang, Wang and Zhu [20]. In particular, considering both random survival probability and zero-inflation phenomenon, Bakouch, Mohammadpour and Shirozhan [21] purposed a zero-inflated geometric INAR(1) time series with random coefficient (short for the ZIGINARRC(1) process). The ZIGINARRC(1) model has simple structure and good properties, which turns out to be the best fit for the real data studied by us.

As the serial dependence shows big influence on the performance of the control chart, the traditional control charts under the assumption of independent observations are not appropriate in many cases. Therefore, the monitoring of INAR(1) models has received much attention. The related research includes but not limited to the control charts for the generally developed Poisson INAR(1) models (Weiß [22], Weiß and Testik [23], Weiß and Testik [24], Yontay, Weiß, Testik and Bayindir [25]), for zero-inflated or zero-deflated INAR(1) models (Rakitzis, Weiß and Castagliola [26], Li, Wang and Sun [27], Fernandes, Bourguignon and Ho [28]), for the mixed INAR(1) model (Sales, Pinho, Vivacqua and Ho [29]), etc. While, to the best of our knowledge, methods for monitoring the zero-inflated INAR(1) model with random coefficient have not been studied in the literature so far, which is exactly what we are going to explore. As cumulative sum (CUSUM) control charts are known to be sensitive in detecting small shifts, we study the performance of the CUSUM chart for monitoring ZIGINARRC(1) process. We investigate the practical guidelines for the statistical design and the methods for evaluating the chart performance. Besides monitoring mean shifts of the ZIGINARRC(1) model, our scope is also to monitor correlation shifts in the model. Meanwhile, we compare the performance of the CUSUM chart with the conventional Shewhart chart.

The rest of the article is outlined as follows. The ZIGINARRC(1) process and some properties of this process are introduced in Section 2. In Section 3, we present the monitoring procedure to detect the mean and correlation shifts of the process. Extensive computation results are discussed in Section 4. In Section 5, the applicability of the process monitoring is investigated using the monthly number of drug crimes in Pittsburgh. Finally conclusions and possible future lines of research are shown in Section 6.

#### **2. The ZIGINARRC(1) Process**

A randomized binomial thinning operation in Bakouch, Mohammadpour and Shirozhan [21] is defined by

$$\mathfrak{a}\_t \circ X = \begin{cases} \ \mathfrak{a} \circ X, & \text{with probability } 1 - \beta\_{\prime} \\\ \ 0, & \text{with probability } \beta\_{\prime} \end{cases}$$

where *α*, *β* ∈ (0, 1), *α<sup>t</sup>* is a binary random variable independent of discrete random variable *X*, *P*(*α<sup>t</sup>* = 0) = *β* = 1 − *P*(*α<sup>t</sup>* = *α*).

Based on the definition of the randomized binomial thinning operation, the ZIGINARRC(1) model {*Xt*} presented by Bakouch, Mohammadpour and Shirozhan [21], is given by

$$X\_t = \alpha\_t \circ X\_{t-1} + \varepsilon\_{t\prime} \quad t = 1, 2, \cdots, \ell$$

where the marginal distribution is a zero-inflated Geometric distribution (denoted as ZIG(*p*, *<sup>θ</sup>*)), *<sup>P</sup>*(*Xt* <sup>=</sup> <sup>0</sup>) = *<sup>p</sup>* + (<sup>1</sup> <sup>−</sup> *<sup>p</sup>*)/(<sup>1</sup> <sup>+</sup> *<sup>θ</sup>*), *<sup>P</sup>*(*Xt* <sup>=</sup> *<sup>j</sup>*)=(<sup>1</sup> <sup>−</sup> *<sup>p</sup>*)*θ<sup>j</sup>* /(1 + *θ*)*j*<sup>+</sup>1, *j* = 1, 2, ··· , and *p* ∈ (0, 1), *θ* > 0. {*εt*}*<sup>N</sup>* is independent of the past of the solution {*Xs*;*s* < *t*} and the binary sequence {*αt*}, parameters are also constrained by the condition *p*/(*β* + *p*(1 − *β*)) < *α* < 1.

The ZIGINARRC(1) process is quite suitable for modelling some real-life phenomena in which counted events may survive or vanish with the random survival probability *αt*. Such series are studied in Section 5 with an example of the counts of the drug crimes, where the re-offending rate may be affected by public security situation and financial situation. The mean, variance, and first-order autocorrelation function of the process are, respectively,

$$\mu\_X \triangleq \mathbb{E}(X\_t) = (1 - p)\theta,\ \sigma\_X^2 \triangleq \text{Var}(X\_t) = (1 - p)\theta[(1 + p)\theta + 1],$$

$$\rho\_X \triangleq \text{Corr}(X\_{t\prime} X\_{t+1}) = \alpha(1 - \beta).$$

Obviously the process is characterized by the property of overdispersion, i.e., the variance greater than the expectation. Figure 1 shows some sample paths of simulated ZIGINARRC(1) processes for *θ* = 1, 3, 5; *p* = 0.2, 0.5; *α* = 0.5, 0.8 and *β* = 0.3, 0.8. As we can see, the model has larger process mean with larger *θ*, and larger percentage of zeros with larger *p*.

**Figure 1.** Sample path of ZIGINARRC(1) processes for *θ* = 1, 3, 5, *p* = 0.2, 0.5, *α* = 0.5, 0.8 and *β* = 0.3, 0.8.

Following Theorem 2.1 in Bakouch, Mohammadpour and Shirozhan [21], the ZIGINARRC(1) model has a unique, strictly stationary solution given by

$$\sum\_{i=1}^{\infty} \left( \prod\_{l=0}^{i-1} a\_{t-l} \right) \circ \varepsilon\_{t-i} + \varepsilon\_{t-i}$$

Furthermore, the probability mass function of {*εt*}*<sup>N</sup>* is

$$\begin{split} P(\varepsilon\_{t} = j) &= \frac{p}{\beta + p(1 - \beta)} I\_{(j)} + \frac{(1 - p)(1 - a)}{1 - \mathfrak{a}[\beta + p(1 - \beta)]} \frac{\theta^{j}}{(1 + \theta)^{j + 1}} \\ &+ \frac{(1 - p)(1 - \beta)[\mathfrak{a}(\beta + p(1 - \beta)) - p]}{(1 - \mathfrak{a}[\beta + p(1 - \beta)])(\beta + p(1 - \beta))} \\ &\times \frac{[\mathfrak{a}\theta(\beta + p(1 - \beta))]^{j}}{[1 + \mathfrak{a}\theta(\beta + p(1 - \beta))]^{j + 1}} \qquad j = 0, 1, 2, \cdots, \frac{1}{2} \end{split}$$

where *I*(*j*) = 1 for *j* = 0 and 0 else. It can be deduced that the innovation series {*εt*}*<sup>N</sup>* is a mixture of three random variables, a degenerate distribution at 0, Geometric(*θ*/(1 + *θ*)) and Geometric(*αθ*(*β* + *p*(1 − *β*))/(1 + *αθ*(*β* + *p*(1 − *β*)))) distributions with three different mixing portions. The following form is the transition probability of the process {*Xt*}*N*<sup>0</sup>

$$P(X\_t = j | X\_{t-1} = i) = \beta P(\varepsilon\_t = j) + (1 - \beta) \sum\_{l=0}^{\min(i, j)} \binom{i}{l} a^l (1 - a)^{i-l} P(\varepsilon\_t = j - l),$$
 
$$i, j = 0, 1, \dots, \dots$$

Some other important probabilistic properties of the process, like spectral density, multistep conditional mean and variance, extreme order statistics, distributional properties of length of run of zeros, have also been discussed in Bakouch, Mohammadpour and Shirozhan [21]. Furthermore, the unknown parameters of the model could be estimated by conditional least squares or maximum likelihood methods.

#### **3. Monitoring Procedure**

In this section, we present a CUSUM chart for monitoring the ZIGINARRC(1) process. As this process is used to fit the number of crimes, an increase in the process mean usually means a deteriorating public security environment, and an increase in the process correlation usually means more re-offendings. Thus, our purpose is to detect the increasing of both mean shifts and correlation shifts in the ZIGINARRC(1) process. According to the model properties, the process mean is affected by the parameters *θ* and *p*, the correlation is affected by the parameters *α* and *β*. Let *θ*0, *p*0, *α*<sup>0</sup> and *β*<sup>0</sup> (*θ*1, *p*1, *α*<sup>1</sup> and *β*1) denote the in-control (out-of-control) parameters of the processes, and *μ*0, *σ*0, *ρ*<sup>0</sup> (*μ*1, *σ*1, *ρ*1) be the corresponding in-control (out-of-control) process mean, standard deviation and first-order correlation.

The CUSUM charts are commonly used charts in statistical process control, which were first proposed by Page [30]. The essential assumption underlying the design of CUSUM charts is that the process observations are independent (Montgomery [31], Alencar, Ho and Albarracin [32], Bourguignon, Medeiros, Fernandes and Ho [33]). While the violation of this major assumption seriously affects the monitoring performance of the charts (Harris and Ross [34], Triantafyllopoulos and Bersimis [35], Albarracin, Alencar and Ho [36]). Some authors have studied the performance of CUSUM charts for some integer-valued models (Weiß and Testik [23], Weiß and Testik [24], Yontay, Weiß, Testik and Bayindir [25], Rakitzis, Weiß and Castagliola [26], Li, Wang and Sun [27], Lee and Kim [37], Lee, Kim and Kim [38]).

**Scheme (The ZIGINAR***RC***(1) CUSUM chart).** *Let* {*Xt*}*N*<sup>0</sup> *be a stationary ZIGINARRC(1) process, the CUSUM statistics Ct is defined as:*

$$\mathbb{C}\_t = \max(0, X\_t - k + \mathbb{C}\_{t-1}), \ t = 1, 2, \dots, \tau$$

*where k is a positive integer constant representing the reference (k μ*0*). This chart is said to be out-of-control when Ct falls outside the control limit h (h* ∈ *N), that is, Ct* > *h.*

The initial value of the CUSUM statistics is set equal to the integer constant *c*0, i.e., *C*<sup>0</sup> = *c*<sup>0</sup> with *c*<sup>0</sup> < *h*. The performance evaluation of this chart is accomplished based on the average run length (ARL) measures, which is defined as the average number of points to be plotted on the chart until the first out-of-control signal triggers. As {*Xt*, *Ct*}*t*∈*N*<sup>0</sup> of the ZIGINARRC(1) process is a bivariate Markov chain, the Markov chain approach proposed by Brook and Evans [39] is adapted to evaluate the exact ARLs. Though this method has been described in detail in the relevant literature by Weiß [22], Weiß and Testik [23] and Weiß and Testik [24], we briefly introduce this method here for completeness. The reachable control region (CR) of {*Xt*, *Ct*}*N*<sup>0</sup> is given by

$$\begin{aligned} \mathcal{CR} & \triangleq \{ (n, i) \in N\_0 \times \{ 0, \dots, h \} \, | \, \max(0, n - k + i) \in \{ 0, \dots, h \} \} \\ & = \{ (n, i) | i \in \{ 0, \dots, h \} \,, n \in \{ \max(0, i + k - h), \dots, i + k \} \} .\end{aligned}$$

Obviously CR has a finite number of elements and could be ordered in a certain manner. The transition probability matrix of {*Xt*, *Ct*}*N*<sup>0</sup> is *<sup>Q</sup>* (*p*(*n*, *<sup>j</sup>*|*m*, *<sup>i</sup>*))(*n*,*j*),(*m*,*i*)∈CR,

$$p(n,j|m,i) \triangleq P(X\_l=n, \mathbb{C}\_l=j|X\_{l-1}=m, \mathbb{C}\_{l-1}=i) = I\_{(j-\max(0,n-k+i))}P(X\_l=n|X\_{l-1}=m).$$

The initial probabilities are

$$p(n,j|\mathcal{c}\_0) \triangleq P(X\_1 = n, \mathcal{C}\_1 = j | \mathcal{C}\_0 = \mathcal{c}\_0) = I\_{(j - \max(0, n - k + \mathcal{c}\_0))} P(X\_1 = n).$$

The conditional probability that the run length of {*Xt*, *Ct*}*N*<sup>0</sup> equals *r* is defined by

$$p\_{m,i}(r) \triangleq P((X\_{r+1}, \mathbb{C}\_{r+1}) \notin \mathcal{CR}\_r(X\_r, \mathbb{C}\_r), \cdot, \cdot, (X\_2, \mathbb{C}\_2) \in \mathcal{CR}|(X\_1, \mathbb{C}\_1) = (m, i)),$$

where (*m*, *i*) ∈ CR. Let the vector *μ***(***k***)** denote the *k*-th factorial moments that (*u***(***k***)**)*m*,*<sup>i</sup>* ∑<sup>∞</sup> *<sup>r</sup>*=<sup>1</sup> *r*(*k*) *pm*,*i*(*r*) where *k* 1 and *r*(*k*) *r* × (*r* − 1) ×···× (*r* − *k* + 1). Then

$$p\_{m,i}(r) = \sum\_{(n,j)\in\mathcal{CR}} p\_{n,j}(r-1) \times p(n,j|m,i),$$

$$(\mathfrak{u}\_{\{\mathbf{1}\}})\_{m,i} = 1 + \sum\_{(n,j)\in\mathcal{CR}} p(n,j|m,i) \times (\mathfrak{u}\_{\{\mathbf{1}\}})\_{n,j}, \text{ i.e., } (\mathbf{I}-\mathbf{Q})u\_{\{\mathbf{1}\}} = \mathbf{1}.$$

The ARL is obtained as

$$\text{ARL} = \sum\_{(m,i)\in \mathcal{CR}} (\mu\_{\{1\}})\_{m,i} \times p(m, i|\alpha\_0).$$

For simplicity we do not repeat the proof methods, see Weiß [22], Weiß and Testik [23] and Weiß and Testik [24] for more details. It is expected that an efficient chart possesses a large in-control ARL (denoted as ARL0) and a small out-of-control ARL. Along with the ARL, we also assess the performance of the charts through the standard deviation of run length (SDRL) suggested by Weiß [22]. The SDRL of the ZIGINARRC(1) CUSUM chart could again be computed efficiently by applying the Markov chain method. The second order factorial moments *u***(2)** can be determined recursively from the relation **(***I − Q***)***u***(2)** = 2*Qu***(1)**. Then the SDRL is

$$\text{SDRL} = \sqrt{\sum\_{(m,i)\in\mathcal{CR}} \left( (\mathfrak{u}\_{\{2\}})\_{m,i} + (\mathfrak{u}\_{\{1\}})\_{m,i} \right) \times p(m,i|\mathfrak{c}\_0) - \text{ARL}^2} \text{ .}$$

To implement the proposed monitoring scheme, the chart design pairs (*h*, *k*) need to be designed in advance. Generally, a fixed ARL0 value is set to be the target value, and (*h*, *k*) is set accordingly. Some guidelines for the choices of them will be given in the next section.

#### **4. Computation Results**

In this section, we evaluate the ZIGINARRC(1) CUSUM chart performance basing on extensive numerical experiments and presume that the parameters in this model have already been known. In practice, the in-control parameters need to be estimated from the data, as shown in the next section. We search for possible chart designs (integer (*h*, *k*) pairs) in order to adjust the ARL0 close to the target value. Here the target ARL0 value is set to be 370, which is commonly used in the statistical process monitoring domain. Meanwhile, the values of ARL and SDRL are calculated accurately by the Markov chain method, and we only show the results with two decimal places for simplicity. We first compute ARL0

and SDRL0 of the CUSUM chart for different in-control process parameters and initial values in Table 1. The process parameters are: *θ*<sup>0</sup> = {1, 5}; *p*<sup>0</sup> = {0.1, 0.3}; *α*<sup>0</sup> = {0.5, 0.8}; *β*<sup>0</sup> = {0.5, 0.8}. Furthermore, initial values are *c*<sup>0</sup> = {0, 3, 6}. These chosen parameters could cover a broad range of different scenarios. Based on the results in Table 1, three important conclusions can be derived. First, it can be observed that when *c*<sup>0</sup> takes smaller value, the deviation of ARL0 and SDRL0 is small. When *c*<sup>0</sup> takes larger value, there might be a situation where the value of SDRL0 is significantly greater than the value of ARL0 (for example, ARL0 = 409.42, SDRL0 = 442.13 under (*θ*0, *p*0, *α*0, *β*0, *c*0)=(1, 0.3, 0.5, 0.8, 6)). Thus, we assume that *c*<sup>0</sup> = 0 in the following studies to get better robust. Second, as the differences of the values between ARL and SDRL are small when *c*<sup>0</sup> = 0, we only use ARL as the measure in the following computations to save space. Last, the parameter *θ*<sup>0</sup> shows a great influence on the selection of control designs (*h*, *k*), with a larger *θ*<sup>0</sup> comes a larger pair of (*h*, *k*).

**Table 1.** ARL0 and SDRL0 of the CUSUM chart for various *θ*0, *p*0, *α*0, *β*<sup>0</sup> and *c*0.


Due to its simplicity, the conventional Shewhart chart is very popular in monitoring the process shifts. The upper limit for the Shewhart chart is denoted as *UCL*. For observations, when the value of the process {*Xt*}*N*<sup>0</sup> exceeds the threshold value *UCL* (*Xt* > *UCL*), a fault is declared. Figures 2 and 3 investigate the CUSUM method preliminarily by comparing it with the Shewhart method. In both of these figures, we assume that the in-control parameters are *θ*<sup>0</sup> = 2, *p*<sup>0</sup> = 0.2, *α*<sup>0</sup> = 0.5 and *β*<sup>0</sup> = 0.5, which are selected based on the real drug crime data in Section 5. According to these parameters, the CUSUM chart designs can be determined, respectively, as *h* = 31, *k* = 2 (corresponding ARL0 = 383.74); *h* = 19, *k* = 3 (ARL0 = 396.12); *h* = 14, *k* = 4 (ARL0 = 373.27); *h* = 11, *k* = 5 (ARL0 = 370.77); *h* = 9, *k* = 6 (ARL0 = 394.03). Furthermore, the Shewhart chart limit *UCL* = 13 (ARL0=381.31) can be used. It should be noted that two types of changes are considered in Figure 2, which both lead to the upward mean shifts. The first type of changes occurs only in the parameter *θ*, with other parameters invariant, the results are listed in Figure 2a. Similarly, the second type of changes occurs only in the parameter *p*, with other parameters invariant, the results are in Figure 2b. From Figure 2, we can conclude that the CUSUM chart with the design *h* = 31, *k* = 2 outperforms the other

CUSUM charts under most shifts, while the Shewhart chart performs worst among them. For the upward correlation shift scenarios, ARL values under two types of the parameter changes are displayed in Figure 3. The first one considers changes only in the parameter *α*, and the second one considers changes only in the parameter *β*. For each scenario, the Shewhart chart performs increase ratio of ARL with the increase of first-order correlation *ρX*, and the CUSUM chart has the better behaviour in the figure. In a comprehensive view, the conventional Shewhart chart is insensitive for upward mean shifts caused by changes in parameter *p*, and fails to detect shifts in the correlation. While the proposed CUSUM chart could overcome these limitations and has superiorities in various coefficient shifts compared with the Shewhart chart. From the figures, we can also conclude that the smaller the value of *k*, the more sensitive the CUSUM chart is. As the constraint *k μ*<sup>0</sup> is required to make the chart reasonable, it is natural to recommend *k* = *μ*<sup>0</sup> (the smallest integer no less than *μ*0), then we aim to select the value of *h* such that ARL0 is close to 370. Now the computations of the CUSUM chart are extended to general cases with designs of experiments as follow.

**Figure 2.** The performance of the Shewhart chart and the CUSUM chart to detect an increase in process mean, (**a**) changes only in *θ*, (**b**) changes only in *p*.

**Figure 3.** The performance of the Shewhart chart and the CUSUM chart to detect an increase in process first-order correlation, (**a**) changes only in *α*, (**b**) changes only in *β*.

In Tables 2–4, we focus on situations that there are increasing shifts in process mean, and the correlation remains the same. Each in-control parameter has three levels: *θ*<sup>0</sup> = {1, 3, 5}; *p*<sup>0</sup> = {0.1, 0.2, 0.3}; *α*<sup>0</sup> = {0.5, 0.6, 0.7}; *β*<sup>0</sup> = {0.5, 0.6, 0.7}. We consider the case when the changes only occur in *θ*, this is the most common case. The out-ofcontrol process mean is *μ*<sup>1</sup> = *μ*<sup>0</sup> + *δσ*0, the shift size *δ* considers potential values in set {0.5, 1, 1.5, 6}. The usual relative deviation (in %) in ARL is defined as dev(%) = 100%<sup>×</sup>

(ARL-ARL0)/ARL0 (Weiß and Testik [24]). From Tables 2–4, we can conclude that the ZIGINARRC(1) CUSUM chart performs quite well in detecting upward mean shifts for all scenarios. For the small shift size of *δ* = 0.5, the CUSUM chart is efficient with the minimum 86.38% drop of ARL and the maximum 91.23% drop of ARL. Take larger *δ* for another illustration (*δ* = 1), the drop of ARL is at least 92.81%, and up to 96.03% at most. It can also be obtained that *δ* has to be at least 6 to get an immediate signal with the out-of-control ARL closer to 1. In addition, extensive computation results show that the in-control parameters (*θ*0, *p*0, *α*0, *β*0) have little effect on the better performance of the CUSUM chart to detect the mean shifts.


**Table 2.** ARL profiles of the CUSUM chart versus mean shifts under *θ*<sup>0</sup> = 1.


**Table 3.** ARL profiles of the CUSUM chart versus mean shifts under *θ*<sup>0</sup> = 3.


**Table 4.** ARL profiles of the CUSUM chart versus mean shifts under *θ*<sup>0</sup> = 5.

The computation study in Tables 5 and 6 concerns the upward shifts in the process correlation. Two levels are accepted for each in-control parameter: *θ*<sup>0</sup> = {1, 5}; *p*<sup>0</sup> = {0.1, 0.2}; *α*<sup>0</sup> = {0.5, 0.6}; *β*<sup>0</sup> = {0.7, 0.8}. Two types of out-of-control pattern are considered here for comprehensive investigation, the first type is that the upward changes only exist in *α* (shown in Table 5), and the second type is that the downward changes only exist in *β* (shown Table 6). In Table 5, the shifts in the magnitude *δα* = *α*<sup>1</sup> − *α*<sup>0</sup> are from the set {0.1, 0.2, 0.3}. The results imply that the performance of CUSUM chart fluctuates greatly in detecting correlation shifts caused only by *α*. To be specific, when *δα* is 0.3, dev(%) ranges from −6.81% to −23.68%. Another finding based on the design of experiments in Table 5 is that both a smaller *θ*<sup>0</sup> and a smaller *β*<sup>0</sup> could slightly improve detection efficiency, while *p*<sup>0</sup> and *α*<sup>0</sup> could not. In Table 6, the shifts magnitude *δβ* = *β*<sup>1</sup> − *β*<sup>0</sup> are from the set {−0.1, −0.2, −0.3}. From Table 6, we can see the CUSUM chart is more efficient in detecting the correlation shifts caused by *β*. As the absolute value of *δβ* gets bigger, the decreasing proportion of ARL gradually increased. When *δβ* <sup>=</sup> <sup>−</sup>0.3, dev(%) ranges from <sup>−</sup>21.3% to <sup>−</sup>40.39%. Meanwhile, we can further conclude that a smaller *θ*<sup>0</sup> and a larger *α*<sup>0</sup> often lead to better chart performance, and *p*0, *β*<sup>0</sup> have little influence. Based on all the analysis above in this paragraph, we can further conclude that a smaller *θ*<sup>0</sup> and a larger value of initial correlation *ρ*<sup>0</sup> are helpful to detect the correlation shifts. Furthermore, that we cannot get an immediate signal when only correlation shifts occur.

**Table 5.** ARL profiles of the CUSUM chart versus correlation shifts caused by *α*.



**Table 6.** ARL profiles of the CUSUM chart versus correlation shifts caused by *β*.

#### **5. Analyses of Drug Crime Count Time Series**

In this section, we present a case study of crime count data in Pittsburgh. The data set contains multiple crime types, such as arson, drink-driving, robbery and so on. Monitoring of crime data is needed not only for early warnings of the organised crime, but also for assessments of the social security environment. For the crime data, the readers can download it from the Forecasting Principles site (http://www.forecastingprinciples.com, accessed on 20 March 2021), or email to the corresponding author to access. The subset we analyse is a monthly drug use count data collected from the 56th police car beat, which contains 144 observations from January 1990 to December 2001. There are 67 zeros in this drug use data (the proportion up to 46.53%), which have the greatest proportion among the other values for the data series. The sample mean, variance and first-order autocorrelation of the data are 1.7153, 6.4289 and 0.3886, respectively, which show strong overdispersion and autocorrelation. The sample path and the histogram of the series are in Figure 4. The histograms of estimated ZIG distribution, estimated Geometric distribution and estimated Poisson distribution are also given in Figure 4b, which indicate that the ZIG marginal is the most appropriate to describe the data. The sample autocorrelation function (ACF) and the sample partial autocorrelation function (PACF) in Figure 5 reveal that the series most likely comes from an AR-type process of order 3. While our intention is to illustrate the implementation of the proposed control chart, we will employ the first-order INAR models that are widely studied and applied in the literature. The consideration of more complex models will be left for future study.

**Figure 4.** The plots about the zero-inflated drug crime series, (**a**) the sample path, (**b**) the histogram with ZIG fit, Geometric fit and Poisson fit.

**Figure 5.** The plots about the zero-inflated drug crime series, (**a**) the ACF plot, (**b**) the PACF plot.

Except for the ZIGINARRC(1) model, some competitive models are also applied to the time series, such as Poisson INAR(1) (Al-Osh and Alzaid [1]), GINAR(1) (Alzaid and Al-Osh [7]), ZINAR(1) (Jazi, Jones and Lai [9]), ZMGINAR(1) (Barreto-Souza [11]), NGINAR(1) (Risti´c, Bakouch and Nasti´c [12]), ZIMINAR(1) (Li, Wang and Zhang [14]). The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are suggested to evaluate these models. Numerical results in Table 7 show that the ZIGINARRC(1) process has the best overall performance, compared with its competitors. Therefore, we assume that the drug crime data is from the ZIGINARRC(1) model and the estimated parameters in Table 7 are used in the process control procedures. Based on the computation results in Section 4, the CUSUM chart with designs *h* = 34, *k* = 2 (corresponding ARL0 = 364.44) is the best choice, which is shown in Figure 6a. For comparison, we also present CUSUM control charts with designs h=15, *k* = 4 (ARL0 = 358.40) in Figure 6b, designs *h* = 12, *k* = 5 (ARL0 = 372.28) in Figure 6c, and the Shewhart chart with control limit *UCL* = 13 (ARL0 = 340.25) in Figure 6d. We observe that all the CUSUM charts give out-of-control signals, while there is no outliers in the Shewhart chart. Because the Shewhart chart has been proved to be less effective than the CUSUM chart, the drug crime data set seems to be out-of-control with increasing mean shifts or increasing correlation shifts, and some investigation should be done for further explanation. The CUSUM control charts with three designs also display different detection efficiencies. The CUSUM chart with *k* = 2 first signals at *t* = 131 following with continuous alarms as *t* increases. The signals of the CUSUM chart with *k* = 4 are first given at *t* = 133, then go back below the control limit over a period of time, and come again at *t* = 141. While the outlier of the CUSUM

chart with *k* = 5 occurs at *t* = 144. The analysis above proves again that the CUSUM chart design *k* = *μ*<sup>0</sup> is the most effective in practice.


**Table 7.** The estimated parameters, AIC and BIC of candidate models.

**Figure 6.** The control charts for the zero-inflated drug crime series, the CUSUM charts with designs (**a**) *h* = 34, *k* = 2, (**b**) *h* = 15, *k* = 4, (**c**) *h* = 12, *k* = 5, and (**d**) the Shewhart chart with *UCL* = 13.

#### **6. Conclusions**

In this paper, we have made contributions on monitoring the zero-inflated autocorrelated count data which can be described by an INAR(1) process with random coefficient. ARL and SDRL are adopted to be the measures and calculated by the Markov chain approach. The design parameter *k* in the CUSUM chart is proved to have great influence on monitoring efficiency, and the smallest integer no less than *μ*<sup>0</sup> is the recommended value of *k*. The proposed ZIGINARRC(1)CUSUM chart is proved to have superiorities in various coefficient shifts compared with the conventional Shewhart chart. Computation results also show that the CUSUM chart performs quite well in detecting upward mean shifts, and shows fluctuation in detecting upward correlation shifts. Based on the design of experiment, we also find that a larger value of initial correlation *ρ*<sup>0</sup> is helpful to detect the correlation shifts. An immediate signal occurs after large upward mean shifts, while does not occur when only correlation shifts exist.

There are some possible topics for our future research. First, we can consider a different monitoring scheme for the ZIGINARRC(1) process and conduct a comparison study. Second, we can explore the monitoring of *p*-th random coefficient INAR model, which is suitable for the count data with high order dependence. Third, we can study a

multivariate INAR(1) process with random coefficient to continuously monitor the serial correlated counts that we are interested in.

**Author Contributions:** Conceptualization, C.L. and D.W.; methodology, C.L.; software, S.C.; validation, C.L., S.C. and D.W.; formal analysis, D.W.; investigation, C.L.; resources, C.L. and D.W.; data curation, S.C.; writing—original draft preparation, S.C.; writing—review and editing, C.L.; visualization, S.C.; supervision, D.W.; project administration, C.L.; funding acquisition, C.L. and D.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is supported by National Natural Science Foundation of China grant numbers 12001229, 11871028 and 11731015, and Natural Science Foundation of Jilin Province (No. 20180101216JC).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

	- [CrossRef]
	- [CrossRef]

### *Article* **Robust Estimation for Bivariate Poisson INGARCH Models**

**Byungsoo Kim 1,\*, Sangyeol Lee <sup>2</sup> and Dongwon Kim <sup>2</sup>**


**Abstract:** In the integer-valued generalized autoregressive conditional heteroscedastic (INGARCH) models, parameter estimation is conventionally based on the conditional maximum likelihood estimator (CMLE). However, because the CMLE is sensitive to outliers, we consider a robust estimation method for bivariate Poisson INGARCH models while using the minimum density power divergence estimator. We demonstrate the proposed estimator is consistent and asymptotically normal under certain regularity conditions. Monte Carlo simulations are conducted to evaluate the performance of the estimator in the presence of outliers. Finally, a real data analysis using monthly count series of crimes in New South Wales and an artificial data example are provided as an illustration.

**Keywords:** integer-valued time series; bivariate Poisson INGARCH model; outliers; robust estimation; minimum density power divergence estimator

#### **1. Introduction**

Integer-valued time series models have received widespread attention from researchers and practitioners, due to their versatile applications in many scientific areas, including finance, insurance, marketing, and quality control. Numerous studies focus on integervalued autoregressive (INAR) models to analyze the time series of counts, see Weiß [1] and Scotto et al. [2] for general reviews. Taking a different approach, Ferland et al. [3] proposed using Poisson integer-valued generalized autoregressive conditional heteroscedastic (INGARCH) models and Fokianos et al. [4] developed Poisson AR models to generalize the linear assumption on INGARCH models. The Poisson assumption on INGARCH models has been extended to negative binomial INGARCH models (Davis and Wu [5] and Christou and Fokianos [6]), zero-inflated generalized Poisson INGARCH models (Zhu [7,8] and Lee et al. [9]), and one-parameter exponential family AR models (Davis and Liu [10]). We refer to the review papers by Fokianos [11,12] and Tjøstheim [13,14] for more details.

Researchers invested considerable efforts to extend the univariate integer-valued time series models to bivariate (multivariate) models. For INAR type models, Quoreshi [15] proposed bivariate integer-valued moving average models and Pedeli and Karlis [16] introduced bivariate INAR models with Poisson and negative binomial innovations. Liu [17] proposed bivariate Poisson INGARCH models with a bivariate Poisson distribution that was constructed via the trivariate reduction method and established the stationarity and ergodicity of the model. Andreassen [18] later verified the consistency of the conditional maximum likelihood estimator (CMLE) and Lee et al. [19] studied the asymptotic normality of the CMLE and developed the CMLE- and residual-based change point tests. However, this model has the drawback that it can only accommodate positive correlation between two time series of counts. To cope with this issue, Cui and Zhu [20] recently introduced a new bivariate Poisson INGARCH model based on Lakshminarayana et al.'s [21] bivariate Poisson distribution. Their model can deal with positive or negative correlation, depending on the multiplicative factor parameter. They employed the CMLE for parameter estimation. However, because the CMLE is unduly influenced by outliers, the robust estimation in bivariate Poisson INGARCH models is crucial and deserves thorough investigation.

**Citation:** Kim, B.; Lee, S.; Kim, D. Robust Estimation for Bivariate Poisson INGARCH Models. *Entropy* **2021**, *23*, 367. https://doi.org/ 10.3390/e23030367

Academic Editor: Christian H. Weiss

Received: 15 February 2021 Accepted: 16 March 2021 Published: 19 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

As such, here we develop a robust estimator for Cui and Zhu's [20] bivariate Poisson INGARCH models. Among the robust estimation methods, we employ the minimum density power divergence estimator (MDPDE) approach that was originally proposed by Basu et al. [22], because it is well known to consistently provide robust estimators in various situations. For previous works in the context of time series of counts, see Kang and Lee [23], Kim and Lee [24,25], Diop and Kengne [26], Kim and Lee [27], and Lee and Kim [28], who studied the MDPDE for Poisson AR models, zero-inflated Poisson AR models, one-parameter exponential family AR models, and change point tests. For another robust estimation approach in INGARCH models, see Xiong and Zhu [29] and Li et al. [30], who studied Mallows' quasi-likelihood method. To the best of our knowledge, the robust estimation method for bivariate Poisson INGARCH models has not been previously studied. In earlier studies, the MDPDE was proven to possess strong robust properties against outliers with little loss in asymptotic efficiency relative to the CMLE. This study confirms the same conclusion for bivariate Poisson INGARCH models.

The rest of this paper is organized, as follows. Section 2 constructs the MDPDE for bivariate Poisson INGARCH models. Section 3 shows the asymptotic properties of the MDPDE. Section 4 conducts empirical studies to evaluate the performance of the MDPDE. Section 5 provides concluding remarks. Appendix A provides the proof.

#### **2. MDPDE for Bivariate Poisson Ingarch Models**

Basu et al. [22] defined the density power divergence *dα* between two densities *f* and *g*, with a tuning parameter *α*, as

$$d\_a(g,f) = \begin{cases} \int \{f^{1+a}(y) - (1 + \frac{1}{a})g(y)f^a(y) + \frac{1}{a}g^{1+a}(y)\} dy, & a > 0, \\ \int g(y)(\log g(y) - \log f(y)) dy, & a = 0. \end{cases}$$

For a parametric family {*Fθ*; *θ* ∈ Θ} having densities { *fθ*} and a distribution *G* with density *g*, they defined the minimum density power divergence functional *Tα*(*G*) by *<sup>d</sup>α*(*g*, *fTα*(*G*)) = min*θ*∈<sup>Θ</sup> *<sup>d</sup>α*(*g*, *<sup>f</sup><sup>θ</sup>* ). If *<sup>G</sup>* belongs to {*Fθ*}, which is, *<sup>G</sup>* <sup>=</sup> *<sup>F</sup>θ*<sup>0</sup> for some *<sup>θ</sup>*<sup>0</sup> <sup>∈</sup> <sup>Θ</sup>, then *Tα*(*Fθ*<sup>0</sup> ) = *θ*0. Let *g* be the density function of a random sample *Y*1, ... ,*Yn*. Using the empirical distribution *Gn* to approximate *G*, Basu et al. [22] defined the MDPDE by

$$\hat{\theta}\_{\mathfrak{a},\mathfrak{n}} = \underset{\theta \in \Theta}{\operatorname{argmin}} \, H\_{\mathfrak{a},\mathfrak{n}}(\theta)\_{\mathfrak{n}}$$

where *Hα*,*n*(*θ*) = <sup>1</sup> *<sup>n</sup>* <sup>∑</sup>*<sup>n</sup> <sup>t</sup>*=<sup>1</sup> *hα*,*t*(*θ*) and

$$h\_{a,t}(\theta) = \begin{cases} \int f\_{\theta}^{1+a}(y) dy - \left(1 + \frac{1}{a}\right) f\_{\theta}^{a}(\mathcal{Y}\_{t}), & a > 0, \\\ -\log f\_{\theta}(\mathcal{Y}\_{t}), & a = 0. \end{cases}$$

The tuning parameter *α* controls the trade-off between the robustness and asymptotic efficiency of the MDPDE. Namely, relatively large *α* values improve the robustness but the estimator's efficiency decreases. The MDPDE with *α* = 0 and 1 leads to the MLE and *L*2-distance estimator, respectively. Basu et al. [22] showed the consistency and asymptotic normality of the MDPDE and demonstrated that the estimator is robust against outliers, but it still retains high efficiency when the true distribution belongs to a parametric family {*Fθ*} and *α* is close to zero.

We need to define the conditional version of the MDPDE in order to apply the above procedure to bivariate Poisson INGARCH models. Let { *f<sup>θ</sup>* (·|F*t*−1)} denote the parametric family of autoregressive models, being indexed by the parameter *θ*, and let *fθ*<sup>0</sup> (·|F*t*−1) be the true conditional density of the time series *Yt* given F*t*−1, where F*t*−<sup>1</sup> is a *σ*-field generated by *Yt*−1,*Yt*−2, . . .. Subsequently, the MDPDE of *<sup>θ</sup>*<sup>0</sup> is given by

$$\theta\_{\mathfrak{a},\mathfrak{n}} = \operatorname\*{argmin}\_{\theta \in \Theta} H\_{\mathfrak{a},\mathfrak{n}}(\theta)\_{\mathfrak{n}}$$

where *Hα*,*n*(*θ*) = <sup>1</sup> *<sup>n</sup>* <sup>∑</sup>*<sup>n</sup> <sup>t</sup>*=<sup>1</sup> *hα*,*t*(*θ*) and

$$h\_{a,t}(\theta) = \begin{cases} \int f\_{\theta}^{1+a}(y|\mathcal{F}\_{t-1}) dy - \left(1 + \frac{1}{a}\right) f\_{\theta}^{a}(Y\_{t}|\mathcal{F}\_{t-1}), & a > 0, \\\ -\log f\_{\theta}(Y\_{t}|\mathcal{F}\_{t-1}), & a = 0 \end{cases} \tag{1}$$

(cf. Section 2 of Kang and Lee [23]).

Let *<sup>Y</sup><sup>t</sup>* = (*Yt*,1,*Yt*,2)*<sup>T</sup>* be a two-dimensional vector of counts at time *<sup>t</sup>*, namely, {*Yt*,1,*<sup>t</sup>* <sup>≥</sup> <sup>1</sup>} and {*Yt*,2,*t* ≥ 1} are the two time series of counts under consideration. Liu [17] proposed the bivariate Poisson INGARCH model, as follows

$$\mathcal{Y}\_t|\mathcal{F}\_{t-1} \sim BP^\*(\lambda\_{t,1}, \lambda\_{t,2}, \phi), \quad \lambda\_t = (\lambda\_{t,1}, \lambda\_{t,2})^T = \omega + A\lambda\_{t-1} + \mathcal{B}\mathcal{Y}\_{t-1}\omega$$

where <sup>F</sup>*<sup>t</sup>* is the *<sup>σ</sup>*-field generated by *<sup>Y</sup>t*, *<sup>Y</sup>t*−1, ..., *<sup>φ</sup>* <sup>≥</sup> 0, *<sup>ω</sup>* = (*ω*1, *<sup>ω</sup>*2)*<sup>T</sup>* <sup>∈</sup> <sup>R</sup><sup>2</sup> <sup>+</sup>, *A* = {*aij*}*i*,*j*=1,2 and *B* = {*bij*}*i*,*j*=1,2 are 2 × 2 matrices with non-negative entries. *BP*∗(*λt*,1, *λt*,2, *φ*) denotes the bivariate Poisson distribution constructed via the trivariate reduction method, whose probability mass function (PMF) is

$$\begin{split} &P(Y\_{t,1} = y\_1, Y\_{t,2} = y\_2 | \mathcal{F}\_{t-1}) \\ &= \ & \epsilon^{-(\lambda\_{t,1} + \lambda\_{t,2} - \phi)} \frac{(\lambda\_{t,1} - \phi)^{y\_1}}{y\_1!} \frac{(\lambda\_{t,2} - \phi)^{y\_2}}{y\_2!} \sum\_{s=0}^{\min(y\_1, y\_2)} \binom{y\_1}{s} \binom{y\_2}{s} \, \text{s!} \left\{ \frac{\phi}{(\lambda\_{t,1} - \phi)(\lambda\_{t,2} - \phi)} \right\}^s . \end{split}$$

In this model, *Cov*(*Yt*,1,*Yt*,2|F*t*−1) = *φ* ∈ [0, min(*λt*,1, *λt*,2)), so that the model has a drawback that it can only deal with positive correlation between two components.

To overcome this defect, Cui and Zhu [20] proposed a new bivariate Poisson IN-GARCH model using the distribution that was proposed by Lakshminarayana et al. [21]. They considered the model:

$$\Psi\_t | \mathcal{F}\_{t-1} \sim BP(\lambda\_{t,1\prime}\lambda\_{t,2\prime}\delta), \quad \lambda\_t = (\lambda\_{t,1\prime}\lambda\_{t,2})^T = \omega + A\lambda\_{t-1} + \mathcal{B}Y\_{t-1} \tag{2}$$

and *BP*(*λt*,1, *λt*,2, *δ*) is the bivariate Poisson distribution constructed as a product of Poisson marginals with a multiplicative factor, whose PMF is given by

$$\begin{aligned} P(\mathbf{Y}\_{t,1} = y\_1, \mathbf{Y}\_{t,2} = y\_2 | \mathcal{F}\_{t-1}) \\ = \frac{\lambda\_{t,1}^{y\_1} \lambda\_{t,2}^{y\_2}}{y\_1! y\_2!} e^{-(\lambda\_{t,1} + \lambda\_{t,2})} \left\{ 1 + \delta (e^{-y\_1} - e^{-c\lambda\_{t,1}}) (e^{-y\_2} - e^{-c\lambda\_{t,2}}) \right\}, \tag{3} \end{aligned}$$

where *<sup>c</sup>* <sup>=</sup> <sup>1</sup> <sup>−</sup> *<sup>e</sup>*−1. The marginal conditional distribution of *Yt*,1 and *Yt*,2 are Poisson with parameters *<sup>λ</sup>t*,1 and *<sup>λ</sup>t*,2, respectively, and *Cov*(*Yt*,1,*Yt*,2|F*t*−1) = *<sup>δ</sup>c*2*λt*,1*λt*,2*e*−*c*(*λt*,1+*λt*,2). Hence, this model supports positive or negative correlation, depending on the multiplicative factor parameter *δ*. Cui and Zhu [20] established the stationarity and ergodicity of the model under certain conditions and showed the consistency and asymptotic normality of the CMLE.

In this study, we apply the MDPDE to the model (2). We focus on the case that *A* is a diagonal matrix, because this simplification can reduce the number of model parameters and makes it easy to use in practice, as Heinen and Rengifo [31] suggested. Further, the diagonal setup of *A* eases the verification of the asymptotic properties of the MDPDE. Similar approaches can be found in Liu [17], Lee et al. [19], and Cui et al. [32]. Let *A* = *diag*(*a*1, *a*2). Subsequently, we set *θ* = (*θ<sup>T</sup>* <sup>1</sup> , *<sup>θ</sup><sup>T</sup>* <sup>2</sup> , *<sup>δ</sup>*)*T*, where *<sup>θ</sup>*<sup>1</sup> = (*ω*1, *<sup>a</sup>*1, *<sup>b</sup>*11, *<sup>b</sup>*12)*<sup>T</sup>* and *θ*<sup>2</sup> = (*ω*2, *a*2, *b*21, *b*22)*T*, and write the true parameter as *θ*<sup>0</sup> = (*θ*<sup>0</sup> 1 *T* , *θ*<sup>0</sup> 2 *T* , *δ*0)*T*, where *θ*0 <sup>1</sup> = (*ω*<sup>0</sup> <sup>1</sup>, *<sup>a</sup>*<sup>0</sup> <sup>1</sup>, *<sup>b</sup>*<sup>0</sup> <sup>11</sup>, *<sup>b</sup>*<sup>0</sup> <sup>12</sup>)*<sup>T</sup>* and *<sup>θ</sup>*<sup>0</sup> <sup>2</sup> = (*ω*<sup>0</sup> <sup>2</sup>, *<sup>a</sup>*<sup>0</sup> <sup>2</sup>, *<sup>b</sup>*<sup>0</sup> <sup>21</sup>, *<sup>b</sup>*<sup>0</sup> <sup>22</sup>)*T*.

Given *Y*1,..., *Y<sup>n</sup>* that is generated from (2), from (1), we obtain the MDPDE of *θ*<sup>0</sup> by

$$\theta\_{\mathfrak{a},\mathfrak{n}} = \operatorname\*{argmin}\_{\theta \in \Theta} \overleftarrow{H}\_{\mathfrak{a},n}(\theta) = \operatorname\*{argmin}\_{\theta \in \Theta} \frac{1}{n} \sum\_{t=1}^{n} \tilde{h}\_{\mathfrak{a},t}(\theta),$$

where

$$
\tilde{h}\_{a,t}(\theta) \quad = \begin{cases}
\ \sum\_{g\_1=0}^{\infty} \sum\_{g\_2=0}^{\infty} f\_{\theta}^{1+a}(y|\tilde{\lambda}\_t) - \left(1 + \frac{1}{a}\right) f\_{\theta}^a(\mathcal{Y}\_t|\tilde{\lambda}\_t), & a > 0, \\
 \qquad - \log f\_{\theta}(\mathcal{Y}\_t|\tilde{\lambda}\_t), & a = 0,
\end{cases}
\tag{4}$$

*<sup>f</sup><sup>θ</sup>* (*y*|*λt*) for *<sup>y</sup>* = (*y*1, *<sup>y</sup>*2)*<sup>T</sup>* is the conditional PMF in (3), and *<sup>λ</sup>*˜ *<sup>t</sup>* is recursively defined by

$$\mathcal{A}\_t = (\mathcal{A}\_{t,1}, \mathcal{A}\_{t,2})^T = \omega + A\mathcal{X}\_{t-1} + \mathcal{B}\mathcal{Y}\_{t-1}, \ t \ge 2$$

with an arbitrarily chosen initial value *λ*˜ 1. We also use notations *λt*(*θ*) and *λ*˜ *<sup>t</sup>*(*θ*) to denote *λ<sup>t</sup>* and *λ*˜ *<sup>t</sup>*, respectively, in order to emphasize the role of *θ*.

#### **3. Asymptotic Properties of the MDPDE**

In this section, we establish the consistency and asymptotic normality of the MDPDE. Throughout this study, *A<sup>p</sup>* denotes the *p*-induced norm of matrix *A* for 1 ≤ *p* ≤ ∞ and *x<sup>p</sup>* is the *<sup>p</sup>*-norm of vector *<sup>x</sup>*. When *<sup>p</sup>* <sup>=</sup> 1 and <sup>∞</sup>, *A*<sup>1</sup> <sup>=</sup> max1≤*j*≤*<sup>n</sup>* <sup>∑</sup>*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> |*aij*| and *A*<sup>∞</sup> <sup>=</sup> max1≤*i*≤*<sup>m</sup>* <sup>∑</sup>*<sup>n</sup> <sup>j</sup>*=<sup>1</sup> <sup>|</sup>*aij*<sup>|</sup> for *<sup>A</sup>* <sup>=</sup> {*aij*}1≤*i*≤*m*,1≤*j*≤*n*, respectively. *<sup>E</sup>*(·) is taken under *<sup>θ</sup>*0. We assume that the following conditions hold in order to verify the asymptotic properties of the MDPDE.

**(A1)** *θ*<sup>0</sup> <sup>1</sup>, *<sup>θ</sup>*<sup>0</sup> <sup>2</sup>, and *<sup>δ</sup>*<sup>0</sup> are interior points in the compact parameter spaces <sup>Θ</sup>1, <sup>Θ</sup>2, and <sup>Θ</sup>3, respectively, and Θ = Θ<sup>1</sup> × Θ<sup>2</sup> × Θ3. In addition, there exist positive constants *ωL*, *ωU*, *aL*, *aU*, *bL*, *bU*, and *δU*, such that for *i*, *j* = 1, 2,

$$0 < \omega\_L \le \omega\_{i'} \le \omega\_{ll'}\\0 < a\_L \le a\_i \le a\_{lI'}\\0 < b\_L \le b\_{i\bar{j}} \le b\_{lI'} \text{ and } |\delta| \le \delta\_{ll'}$$

**(A2)** There exist positive constants *<sup>ϕ</sup><sup>L</sup>* and *<sup>ϕ</sup><sup>U</sup>* such that for *<sup>y</sup>* = (*y*1, *<sup>y</sup>*2)*<sup>T</sup>* <sup>∈</sup> <sup>N</sup><sup>2</sup> <sup>0</sup>, *λ* = (*λ*1, *<sup>λ</sup>*2)*<sup>T</sup>* <sup>∈</sup> (0, <sup>∞</sup>)2, and *<sup>δ</sup>* <sup>∈</sup> <sup>Θ</sup>3,

$$0 < \varrho\_L \le \varrho(y, \lambda, \delta) \le \varrho\_{ll'} \text{ where } \varrho(y, \lambda, \delta) = 1 + \delta(e^{-y\_1} - e^{-\varepsilon\lambda\_1})(e^{-y\_2} - e^{-\varepsilon\lambda\_2}).$$

**(A3)** There exists a *<sup>p</sup>* <sup>∈</sup> [1, <sup>∞</sup>] such that *A<sup>p</sup>* <sup>+</sup> <sup>2</sup>(1−1/*p*)*B<sup>p</sup>* <sup>&</sup>lt; 1.

**Remark 1.** *These conditions can be found in Cui and Zhu [20]. According to Theorem 1 in their study,* {(*Yt*, *λt*)} *is stationary and ergodic under (A1) and (A3).*

Subsequently, we obtain the following results; the proofs are provided in the Appendix A.

**Theorem 1.** *Under the conditions (A1)–(A3),*

$$
\hat{\theta}\_{\mathfrak{a},\mathfrak{n}} \stackrel{a.s.}{\longrightarrow} \theta^0 \text{ as } \mathfrak{n} \to \infty.
$$

**Theorem 2.** *Under the conditions (A1)–(A3),*

$$\sqrt{n}(\theta\_{\mathfrak{a},\mathfrak{u}}-\theta^0) \stackrel{d}{\longrightarrow} N(0, f\_{\mathfrak{a}}^{-1} K\_{\mathfrak{a}} f\_{\mathfrak{a}}^{-1}) \text{ as } n \to \infty,$$

*where*

$$J\_{\mathfrak{a}} = -E\left(\frac{\partial^2 h\_{\mathfrak{a},t}(\theta^0)}{\partial \theta \partial \theta^T}\right),\ K\_{\mathfrak{a}} = E\left(\frac{\partial h\_{\mathfrak{a},t}(\theta^0)}{\partial \theta}\frac{\partial h\_{\mathfrak{a},t}(\theta^0)}{\partial \theta^T}\right),\ \mathfrak{a}$$

*and hα*,*t*(*θ*) *is defined by replacing λ*˜ *<sup>t</sup>*(*θ*) *with λt*(*θ*) *in (4).*

**Remark 2.** *Because the tuning parameter α controls the trade-off between the robustness and asymptotic efficiency, choosing the optimal α is an important issue in practice. Several researchers investigated the selection criterion of optimal α; see Fujisawa and Eguchi [33], Durio and Isaia [34],* *and Toma and Broniatowski [35]. Among them, we adopt the method of Warwick [36] to choose α that minimizes the trace of the estimated asymptotic mean squared error (AMSE) defined by*

$$\widehat{A\mathcal{M}}\overline{\mathcal{E}} = (\theta\_{a,n} - \theta\_{1,n})(\theta\_{a,n} - \theta\_{1,n})^T + \widehat{A\ast}\widehat{\circ arr}(\theta\_{a,n})\,\prime$$

*where* ˆ *<sup>θ</sup>*1,*<sup>n</sup> is the MDPDE with <sup>α</sup>* <sup>=</sup> <sup>1</sup> *and As.var* (<sup>ˆ</sup> *θα*,*n*) *is an estimate of the asymptotic variance of* ˆ *θα*,*n, which is computed as*

$$\widehat{Aszur}(\hat{\theta}\_{\mathsf{u},\mathsf{u}}) = \left(\sum\_{t=1}^{n} \frac{\partial^2 \tilde{h}\_{\mathsf{a},t}(\hat{\theta}\_{\mathsf{a},\mathsf{u}})}{\partial \theta \partial \theta^T}\right)^{-1} \left(\sum\_{t=1}^{n} \frac{\partial \tilde{h}\_{\mathsf{a},t}(\hat{\theta}\_{\mathsf{a},\mathsf{u}})}{\partial \theta} \frac{\partial \tilde{h}\_{\mathsf{a},t}(\hat{\theta}\_{\mathsf{a},\mathsf{u}})}{\partial \theta^T}\right) \left(\sum\_{t=1}^{n} \frac{\partial^2 \tilde{h}\_{\mathsf{a},t}(\hat{\theta}\_{\mathsf{a},\mathsf{u}})}{\partial \theta \partial \theta^T}\right)^{-1}.$$

*This criterion is applied to our empirical study in Section 4.2.*

#### **4. Empirical Studies**

#### *4.1. Simulation*

In this section, we report the simulation results to evaluate the performance of the MDPDE. The simulation settings are described, as follows. Using the inverse transformation sampling method (cf. Section 2.3 of Verges [37]), we generate *Y*1, ... ,*Y<sup>n</sup>* from (2) with the initial value *λ*<sup>1</sup> = (0, 0)*T*. For the estimation, *λ*˜ <sup>1</sup> is set to be the sample mean of the data. We first consider *θ* = (*ω*1, *a*1, *b*11, *b*12, *ω*2, *a*2, *b*21, *b*22, *δ*)*<sup>T</sup>* = (1, 0.2, 0.1, 0.2, 0.5, 0.3, 0.4, 0.2, 0.5)*T*, which satisfies **(A3)** with *p* = 1. In this simulation, we compare the performance of the MDPDE with *α* > 0 with that of the CMLE (*α* = 0). We examine the sample mean, variance, and mean squared error (MSE) of the estimators. The sample size under consideration is *n* = 1000 and the number of repetitions for each simulation is 1000. In Tables 1–16, the symbol \* represents the minimal MSEs for each parameter.

Table 1 indicates that, when the data are not contaminated by outliers, the CMLE exhibits minimal MSEs for all parameters, and the MSEs of the MDPDE with small *α* are close to those of the CMLE. The MSE of the MDPDE shows an increasing tendency as *α* increases. Hence, we can conclude that the CMLE outperforms the MDPDE in the absence of outliers.

**Table 1.** Sample mean, variance, and mean squared error (MSE) of estimators when *θ* = (1, 0.2, 0.1, 0.2, 0.5, 0.3, 0.4, 0.2, 0.5)*T*, *n* = 1000, and no outliers exist.


Now, we consider the situation that the data are contaminated by outliers. To this end, we generate contaminated data *Yc*,*<sup>t</sup>* = (*Yc*,*t*,1,*Yc*,*t*,2)*<sup>T</sup>* when considering

$$\mathcal{Y}\_{c,t,\dot{\imath}} = \mathcal{Y}\_{t,\dot{\imath}} + P\_{t,\dot{\imath}}\mathcal{Y}\_{o,t,\dot{\imath}} \quad \dot{\imath} = 1,2,3$$

where *Yt*,*<sup>i</sup>* are generated from (2), *Pt*,*<sup>i</sup>* are i.i.d. Bernoulli random variables with success probability *p*, and *Yo*,*t*,*<sup>i</sup>* are i.i.d. Poisson random variables with mean *γ*. We consider three cases: (*p*, *γ*)=(0.03, 5), (0.03, 10), and (0.05, 10). Tables 2–4 report the results. In the tables, the MDPDE appears to have smaller MSEs than the CMLE for all cases, except for the case of *α* = 1 when (*p*, *γ*)=(0.03, 5). As *p* or *γ* increases, the MSEs of the CMLE increase faster than those of the MDPDE, which indicates that the MDPDE outperforms the CMLE, as the data are more contaminated by outliers. Moreover, as *p* or *γ* increases, the symbol ∗ tends to move downward. This indicates that, when the data are severely contaminated by outliers, the MDPDE with large *α* performs better.

**Table 2.** Sample mean, variance, and MSE of estimators when *θ* = (1, 0.2, 0.1, 0.2, 0.5, 0.3, 0.4, 0.2, 0.5)*T*, *n* = 1000, and (*p*, *γ*)=(0.03, 5).


**Table 3.** Sample mean, variance, and MSE of estimators when *θ* = (1, 0.2, 0.1, 0.2, 0.5, 0.3, 0.4, 0.2, 0.5)*T*, *n* = 1000, and (*p*, *γ*)=(0.03, 10).


**Table 4.** Sample mean, variance, and MSE of estimators when *θ* = (1, 0.2, 0.1, 0.2, 0.5, 0.3, 0.4, 0.2, 0.5)*T*, *n* = 1000, and (*p*, *γ*)=(0.05, 10).


We also consider smaller sample size *n* = 200. The results are presented in Tables 5–8 and they show results similar to those in Tables 1–4. The variances and MSEs of both the CMLE and MDPDE are larger than those in Tables 1–4.

**Table 5.** Sample mean, variance, and MSE of estimators when *θ* = (1, 0.2, 0.1, 0.2, 0.5, 0.3, 0.4, 0.2, 0.5)*T*, *n* = 200, and no outliers exist.


**Table 6.** Sample mean, variance and MSE of estimators when *θ* = (1, 0.2, 0.1, 0.2, 0.5, 0.3, 0.4, 0.2, 0.5)*T*, *n* = 200, and (*p*, *γ*)=(0.03, 5).


**Table 7.** Sample mean, variance, and MSE of estimators when *θ* = (1, 0.2, 0.1, 0.2, 0.5, 0.3, 0.4, 0.2, 0.5)*T*, *n* = 200, and (*p*, *γ*)=(0.03, 10).



**Table 8.** Sample mean, variance, and MSE of estimators when *θ* = (1, 0.2, 0.1, 0.2, 0.5, 0.3, 0.4, 0.2, 0.5)*T*, *n* = 200, and (*p*, *γ*)=(0.05, 10).

In order to evaluate the performance of the MDPDE for negatively cross-correlated data, we consider *<sup>θ</sup>* = (*ω*1, *<sup>a</sup>*1, *<sup>b</sup>*11, *<sup>b</sup>*12, *<sup>ω</sup>*2, *<sup>a</sup>*2, *<sup>b</sup>*21, *<sup>b</sup>*22, *<sup>δ</sup>*)*<sup>T</sup>* = (0.5, 0.1, 0.2, 0.4, 0.3, 0.3, 0.2, 0.1,−0.4)*<sup>T</sup>* with the same *p* and *γ*, as above. The results are reported in Tables 9–16 for *n* = 1000 and 200, respectively. These tables exhibit results that are similar to those in Tables 1–8. Overall, our findings strongly support the assertion that the MDPDE is a functional tool for yielding a robust estimator for bivariate Poisson INGARCH models in the presence of outliers.

**Table 9.** Sample mean, variance, and MSE of estimators when *<sup>θ</sup>* = (0.5, 0.1, 0.2, 0.4, 0.3, 0.3, 0.2, 0.1, <sup>−</sup>0.4)*T*, *<sup>n</sup>* <sup>=</sup> 1000, and no outliers exist.


**Table 10.** Sample mean, variance, and MSE of estimators when *<sup>θ</sup>* = (0.5, 0.1, 0.2, 0.4, 0.3, 0.3, 0.2, 0.1, <sup>−</sup>0.4)*T*, *<sup>n</sup>* <sup>=</sup> 1000, and (*p*, *<sup>γ</sup>*)=(0.03, 5).



**Table 11.** Sample mean, variance, and MSE of estimators when *<sup>θ</sup>* = (0.5, 0.1, 0.2, 0.4, 0.3, 0.3, 0.2, 0.1, <sup>−</sup>0.4)*T*, *<sup>n</sup>* <sup>=</sup> 1000, and (*p*, *<sup>γ</sup>*) = (0.03, 10).

**Table 12.** Sample mean, variance, and MSE of estimators when *<sup>θ</sup>* = (0.5, 0.1, 0.2, 0.4, 0.3, 0.3, 0.2, 0.1, <sup>−</sup>0.4)*T*, *<sup>n</sup>* <sup>=</sup> 1000, and (*p*, *<sup>γ</sup>*) = (0.05, 10).


**Table 13.** Sample mean, variance, and MSE of estimators when *<sup>θ</sup>* = (0.5, 0.1, 0.2, 0.4, 0.3, 0.3, 0.2, 0.1, <sup>−</sup>0.4)*T*, *<sup>n</sup>* <sup>=</sup> 200, and no outliers exist.



**Table 14.** Sample mean, variance, and MSE of estimators when *<sup>θ</sup>* = (0.5, 0.1, 0.2, 0.4, 0.3, 0.3, 0.2, 0.1, <sup>−</sup>0.4)*T*, *<sup>n</sup>* <sup>=</sup> 200, and (*p*, *<sup>γ</sup>*)=(0.03, 5).

**Table 15.** Sample mean, variance, and MSE of estimators when *<sup>θ</sup>* = (0.5, 0.1, 0.2, 0.4, 0.3, 0.3, 0.2, 0.1, <sup>−</sup>0.4)*T*, *<sup>n</sup>* <sup>=</sup> 200, and (*p*, *<sup>γ</sup>*)=(0.03, 10).


**Table 16.** Sample mean, variance, and MSE of estimators when *<sup>θ</sup>* = (0.5, 0.1, 0.2, 0.4, 0.3, 0.3, 0.2, 0.1, <sup>−</sup>0.4)*T*, *<sup>n</sup>* <sup>=</sup> 200, and (*p*, *<sup>γ</sup>*)=(0.05, 10).


#### *4.2. Illustrative Examples*

First, we illustrate the proposed method by examining the monthly count series of crimes provided by the New South Wales Police Force in Australia. The data set is classified by local government area and offence type. This data set has been studied in many literatures, including Lee et al. [9], Chen and Lee [38,39], Kim and Lee [24], and Lee et al. [40]. To investigate the behavior of the MDPDE in the presence of outliers, we consider the data series of liquor offences (LO) and transport regulatory offences (TRO)

in Botany Bay from January 1995 to December 2012, which has 216 observations in each series. Figure 1 plots the monthly count series of LO and TRO and it shows the presence of some deviant observations in each series. The sample mean and variance are 1.912 and 13.14 for LO, and 2.463 and 20.41 for TRO. A large value of the variance of each series is expected to be influenced by outliers. The autocorrelation function (ACF) and partial autocorrelation function (PACF) of LO and TRO, as well as cross-correlation function (CCF) between two series, are given in Figure 2, indicating that the data are both serially and crossly correlated. The cross-correlation coefficient between two series is 0.141.

**Figure 1.** Monthly count series of liquor offences (LO) (**left**) and transport regulatory offences (TRO) (**right**) in Botany Bay.

**Figure 2.** Autocorrelation function (ACF) and partial autocorrelation function (PACF) of LO (**top**) and TRO (**middle**), and cross-correlation function (CCF) (**bottom**) between two series.

We fit the model (2) to the data using both the CMLE and the MDPDE. *λ*˜ <sup>1</sup> is set to be the sample mean of the data. Table 17 reports the estimated parameters with various *α*. The standard errors are given in parentheses and the symbol • represents the minimal AMSE provided in Remark 2. In the table, we can observe that the MDPDE has smaller AMSE than the CMLE and the optimal *<sup>α</sup>* is chosen to be 0.1. The MDPDE with optimal *α* is quite different from the CMLE, in particular, ˆ *δ* is about half of the CMLE. This result indicates that outliers can seriously affect the parameter estimation and, thus, the robust estimation method is required when the data are contaminated by outliers.


**Table 17.** Parameter estimates for bivariate Poisson integer-valued generalized autoregressive conditional heteroscedastic (INGARCH) model for crime data; the symbol • represents the minimal AMSE.

We clean the data by using the approach that was introduced by Fokianos and Fried [41] and apply the CMLE and the MDPDE to this data in order to illustrate the behavior of the estimators in the absence of outliers. Table 18 reports the results. The standard errors and AMSE tend to decrease compared to Table 17. The CMLE has minimal AMSE and the MDPDE with small *<sup>α</sup>* appears to be similar to the CMLE.

**Table 18.** Parameter estimates for bivariate Poisson INGARCH model for cleaned data; the symbol • represents the minimal AMSE.


Now, we consider an artificial example that has negative cross-correlation coefficient. Following Cui and Zhu [20], we generate 1000 samples from univariate Poisson INGARCH model, i.e.,

$$X\_t | \mathcal{F}\_{t-1} \sim P(\lambda\_t), \ \lambda\_t = 1 + 0.35 \lambda\_{t-1} + 0.45 X\_{t-1}.$$

where *P*(*λt*) denotes the Poisson distribution with mean *λt*. Further, we observe the contaminated data *Xc*,*t* as follows

$$X\_{\mathbf{c},t} = X\_{\mathbf{f}} + P\_{\mathbf{f}} X\_{\mathbf{o},t\_{\mathbf{f}}}$$

where *Pt* are i.i.d. Bernoulli random variables with a success probability of 0.03 and *Xo*,*<sup>t</sup>* are i.i.d. Poisson random variables with mean 5. Let *Y<sup>t</sup>* = (*Yt*,1,*Yt*,2)*T*, where *Yt*,1 = *Xc*,*<sup>t</sup>* and *Yt*,2 = *Xc*,*t*+<sup>500</sup> for *t* = 1, ... , 500. The sample mean and variance are 5.196 and 7.380 for *Yt*,1, and 4.538 and 8.129 for *Yt*,2. The cross-correlation coefficient between *Yt*,1 and *Yt*,2 is −0.161. We fit the model (2) to *Y<sup>t</sup>* and the results are presented in Table 19. Similar to Table 17, the MDPDE has smaller AMSE than the CMLE. The optimal *<sup>α</sup>* is chosen to be 0.3 and the corresponding ˆ *δ* is −0.329, whereas the CMLE is 0.772.

**Table 19.** Parameter estimates for bivariate Poisson INGARCH model for artificial data; the symbol • represents the minimal AMSE.


#### **5. Concluding Remarks**

In this study, we developed the robust estimator for bivariate Poisson INGARCH models based on the MDPDE. In practice, this is important, because outliers can strongly affect the CMLE, which is commonly employed for parameter estimation in INGARCH models. We proved that the MDPDE is consistent and asymptotically normal under regularity conditions. Our simulation study compared the performances of the MDPDE and the CMLE, and confirmed the superiority of the proposed estimator in the presence of outliers. The real data analysis also confirmed the validity of our method as a robust estimator in practice. Although we focused on Cui and Zhu's [20] bivariate Poisson INGARCH models here, the MDPDE method can be extended to other bivariate INGARCH models. For example, one can consider the copula-based bivariate INGARCH models that were studied by Heinen and Rengifo [42], Andreassen [18], and Fokianos et al. [43]. We leave this issue of extension as our future research.

**Author Contributions:** Conceptualization, B.K. and S.L.; methodology, B.K., S.L. and D.K.; software, B.K. and D.K.; formal analysis, B.K., S.L. and D.K.; data curation, B.K.; writing—original draft preparation, B.K. and S.L.; writing—review and editing, B.K. and S.L.; funding acquisition, B.K. and S.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT), grant no. NRF-2019R1C1C1004662 (B.K.) and 2021R1A2C1004009 (S.L.).

**Data Availability Statement:** Publicly available datasets were analyzed in this study. This data can be found here: data.gov.au (accessed on 19 March 2021).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

In this Appendix, we provide the proofs for Theorems 1 and 2 in Section 3 when *α* > 0. We refer to Cui and Zhu [20] for the case of *α* = 0. In what follows, we denote *V* and *ρ* ∈ (0, 1) as a generic positive integrable random variable and a generic constant, respectively, and *Hα*,*n*(*θ*) = *n*−<sup>1</sup> ∑*<sup>n</sup> <sup>t</sup>*=<sup>1</sup> *hα*,*t*(*θ*). Furthermore, we employ the notation *λ<sup>t</sup>* = *λt*(*θ*), *λ*˜ *<sup>t</sup>* = *λ*˜ *<sup>t</sup>*(*θ*), *λ*<sup>0</sup> *<sup>t</sup>* = *λt*(*θ*0), *λt*,*<sup>i</sup>* = *λt*,*i*(*θi*), *λ*˜ *<sup>t</sup>*,*<sup>i</sup>* = *λ*˜ *<sup>t</sup>*,*i*(*θi*), *λ*<sup>0</sup> *<sup>t</sup>*,*<sup>i</sup>* = *<sup>λ</sup>t*,*i*(*θ*<sup>0</sup> *<sup>i</sup>* ) for *i* = 1, 2, and *<sup>u</sup>*(*y*, *<sup>λ</sup>*) = *<sup>e</sup>*−*<sup>y</sup>* <sup>−</sup> *<sup>e</sup>*−*c<sup>λ</sup>* for brevity.

**Lemma A1.** *Under (A1)–(A3), we have for i* = 1, 2*,*

*(i) E*(sup*θi*∈Θ*<sup>i</sup> λt*,*i*) < ∞*. (ii) λt*,*<sup>i</sup>* = *λ*<sup>0</sup> *<sup>t</sup>*,*<sup>i</sup> a.s. implies <sup>θ</sup><sup>i</sup>* = *<sup>θ</sup>*<sup>0</sup> *i . (iii) λt*,*<sup>i</sup> is twice continuously differentiable with respect to θ<sup>i</sup> and satisfies*

$$E\left(\sup\_{\theta\_i \in \Theta\_i} \left\| \left. \frac{\partial \lambda\_{t,i}}{\partial \theta\_i} \right\|\_{1} \right\|\_{1}^{4} < \infty \text{ and } E\left(\sup\_{\theta\_i \in \Theta\_i} \left\| \left. \frac{\partial^2 \lambda\_{t,i}}{\partial \theta\_i \partial \theta\_i^T} \right\|\_{1} \right)^2 < \infty. \right)$$

*(iv)* sup*θi*∈Θ*<sup>i</sup>* ? ? ? *∂λt*,*<sup>i</sup> ∂θ<sup>i</sup>* <sup>−</sup> *∂λ*˜ *<sup>t</sup>*,*<sup>i</sup> ∂θi* ? ? ? <sup>1</sup> <sup>≤</sup> *<sup>V</sup>ρ<sup>t</sup> a.s. (v) <sup>ν</sup><sup>T</sup> ∂λ*<sup>0</sup> *t*,*i ∂θ<sup>i</sup>* = <sup>0</sup> *a.s. implies <sup>ν</sup>* = <sup>0</sup>*. (vi)* sup*θi*∈Θ*<sup>i</sup>* <sup>|</sup>*λt*,*<sup>i</sup>* <sup>−</sup> *<sup>λ</sup>*˜ *<sup>t</sup>*,*i*| ≤ *<sup>V</sup>ρ<sup>t</sup> a.s.*

**Proof.** By recursion of (2), we have

$$\begin{aligned} \lambda\_t &= (I\_2 - A)^{-1} \omega + \sum\_{k=0}^{\infty} A^k \mathcal{B} Y\_{t-k-1}, \\ \tilde{\lambda}\_t &= (I\_2 + A + \dots + A^{t-2}) \omega + A^{t-1} \tilde{\lambda}\_1 + \sum\_{k=0}^{t-2} A^k \mathcal{B} Y\_{t-k-1} \end{aligned}$$

and thus, for *i* = 1, 2,

$$\begin{aligned} \lambda\_{t,i} &= \begin{array}{c} \frac{\omega\_i}{1 - a\_i} + \sum\_{k=0}^{\infty} a\_i^k (b\_{i1} Y\_{t-k-1,1} + b\_{i2} Y\_{t-k-1,2}), \\ \tilde{\lambda}\_{t,i} &= \ \frac{\omega\_i}{1 - a\_i} + \sum\_{k=0}^{t-2} a\_i^k (b\_{i1} Y\_{t-k-1,1} + b\_{i2} Y\_{t-k-1,2}), \end{array} \end{aligned}$$

where *<sup>I</sup>*<sup>2</sup> denotes 2 <sup>×</sup> 2 identity matrix and the initial value *<sup>λ</sup>*˜ 1,*<sup>i</sup>* is taken as *<sup>ω</sup>i*/(<sup>1</sup> <sup>−</sup> *ai*) for simplicity. Subsequently, (*i*) − (*v*) can be shown following the arguments in the proof of Theorem 3 in Kang and Lee [44]. For (*vi*), let *<sup>ρ</sup>* <sup>=</sup> sup*θi*∈Θ*<sup>i</sup> ai* < 1. Afterwards, from (2), it holds that

$$\sup\_{\theta\_{\bar{t}} \in \Theta\_{\bar{t}}} |\lambda\_{t,\bar{t}} - \bar{\lambda}\_{t,\bar{t}}| = \sup\_{\theta\_{\bar{t}} \in \Theta\_{\bar{t}}} |a\_{\bar{t}} (\lambda\_{t-1,\bar{t}} - \bar{\lambda}\_{t-1,\bar{t}})| = \dots = \sup\_{\theta\_{\bar{t}} \in \Theta\_{\bar{t}}} |a\_{\bar{t}}^{t-1} (\lambda\_{1,\bar{t}} - \bar{\lambda}\_{1,\bar{t}})| \le V\rho^{t}$$

with *<sup>V</sup>* <sup>=</sup> sup*θi*∈Θ*<sup>i</sup>* <sup>|</sup>*λ*1,*<sup>i</sup>* <sup>−</sup> *<sup>λ</sup>*˜ 1,*i*|/*ρ*. Therefore, the lemma is established.

**Lemma A2.** *Under (A1)–(A3), we have*

$$\sup\_{\theta \in \Theta} |H\_{\mathfrak{A}, \mathfrak{n}}(\theta) - \check{H}\_{\mathfrak{a}, \mathfrak{n}}(\theta)| \stackrel{a.s.}{\longrightarrow} 0 \text{ as } n \to \infty.$$

**Proof.** It is sufficient to show that

$$\sup\_{\theta \in \Theta} |h\_{\alpha, t}(\theta) - h\_{\alpha, t}(\theta)| \stackrel{a.s.}{\longrightarrow} 0 \text{ as } t \to \infty.$$

We write

$$\begin{aligned} & \left| h\_{d,t}(\theta) - \bar{h}\_{d,t}(\theta) \right| \\ & \leq \left| \sum\_{y\_1=0}^{\infty} \sum\_{y\_2=0}^{\infty} \left\{ f\_{\theta}^{1+a}(y|\lambda\_t) - f\_{\theta}^{1+a}(y|\bar{\lambda}\_t) \right\} \right| + \left( 1 + \frac{1}{a} \right) \left| f\_{\theta}^{a}(\mathcal{Y}\_t|\lambda\_t) - f\_{\theta}^{a}(\mathcal{Y}\_t|\bar{\lambda}\_t) \right| \\ & \coloneqq -\left| I\_t(\theta) + II\_t(\theta) . \end{aligned}$$

From **(A1)**, **(A2)**, and the mean value theorem (MVT), we have

*It*(*θ*)=(1 + *α*) ∞ ∑ *y*1=0 ∞ ∑ *y*2=0 *f* <sup>1</sup>+*<sup>α</sup> <sup>θ</sup>* (*y*|*λ*<sup>∗</sup> *t* ) *y*1 *λ*∗ *t*,1 <sup>−</sup> <sup>1</sup> <sup>+</sup> *<sup>c</sup>δ<sup>e</sup>* −*cλ*<sup>∗</sup> *<sup>t</sup>*,1*u*(*y*2, *λ*<sup>∗</sup> *t*,2) *ϕ*(*y*, *λ*∗ *<sup>t</sup>* , *δ*) 6 (*λt*,1 <sup>−</sup> *<sup>λ</sup>*˜ *<sup>t</sup>*,1) + ∞ ∑ *y*1=0 ∞ ∑ *y*2=0 *f* <sup>1</sup>+*<sup>α</sup> <sup>θ</sup>* (*y*|*λ*<sup>∗</sup> *t* ) *y*2 *λ*∗ *t*,2 <sup>−</sup> <sup>1</sup> <sup>+</sup> *<sup>c</sup>δ<sup>e</sup>* −*cλ*<sup>∗</sup> *<sup>t</sup>*,2*u*(*y*1, *λ*<sup>∗</sup> *t*,1) *ϕ*(*y*, *λ*∗ *<sup>t</sup>* , *δ*) 6 (*λt*,2 <sup>−</sup> *<sup>λ</sup>*˜ *<sup>t</sup>*,2) ≤ (1 + *α*) ∞ ∑ *y*1=0 ∞ ∑ *y*2=0 *f<sup>θ</sup>* (*y*|*λ*<sup>∗</sup> *t* ) *y*1 *λ*∗ *t*,1 <sup>+</sup> <sup>1</sup> <sup>+</sup> *<sup>c</sup>*|*δ*|*<sup>e</sup>* −*cλ*<sup>∗</sup> *<sup>t</sup>*,1 |*u*(*y*2, *λ*<sup>∗</sup> *<sup>t</sup>*,2)| *ϕ*(*y*, *λ*∗ *<sup>t</sup>* , *δ*) 6 <sup>|</sup>*λt*,1 <sup>−</sup> *<sup>λ</sup>*˜ *<sup>t</sup>*,1<sup>|</sup> + ∞ ∑ *y*1=0 ∞ ∑ *y*2=0 *f<sup>θ</sup>* (*y*|*λ*<sup>∗</sup> *t* ) *y*2 *λ*∗ *t*,2 <sup>+</sup> <sup>1</sup> <sup>+</sup> *<sup>c</sup>*|*δ*|*<sup>e</sup>* −*cλ*<sup>∗</sup> *<sup>t</sup>*,2 |*u*(*y*1, *λ*<sup>∗</sup> *<sup>t</sup>*,1)| *ϕ*(*y*, *λ*∗ *<sup>t</sup>* , *δ*) 6 <sup>|</sup>*λt*,2 <sup>−</sup> *<sup>λ</sup>*˜ *<sup>t</sup>*,2<sup>|</sup> ≤ (1 + *α*) <sup>1</sup> <sup>+</sup> <sup>1</sup> <sup>+</sup> 2*cδ<sup>U</sup> ϕL* <sup>|</sup>*λt*,1 <sup>−</sup> *<sup>λ</sup>*˜ *<sup>t</sup>*,1<sup>|</sup> <sup>+</sup> 1 + 1 + 2*cδ<sup>U</sup> ϕL* <sup>|</sup>*λt*,2 <sup>−</sup> *<sup>λ</sup>*˜ *<sup>t</sup>*,2<sup>|</sup> = 2(1 + *α*) <sup>1</sup> <sup>+</sup> *<sup>c</sup>δ<sup>U</sup> ϕL* (|*λt*,1 <sup>−</sup> *<sup>λ</sup>*˜ *<sup>t</sup>*,1<sup>|</sup> <sup>+</sup> <sup>|</sup>*λt*,2 <sup>−</sup> *<sup>λ</sup>*˜ *<sup>t</sup>*,2|),

where *λ*∗ *<sup>t</sup>* = (*λ*<sup>∗</sup> *<sup>t</sup>*,1, *λ*<sup>∗</sup> *<sup>t</sup>*,2)*<sup>T</sup>* and *<sup>λ</sup>*<sup>∗</sup> *<sup>t</sup>*,*<sup>i</sup>* is an intermediate point between *<sup>λ</sup>t*,*<sup>i</sup>* and *<sup>λ</sup>*˜ *<sup>t</sup>*,*<sup>i</sup>* for *<sup>i</sup>* <sup>=</sup> 1, 2. Hence, sup*θ*∈<sup>Θ</sup> *It*(*θ*) converges to 0 a.s. as *<sup>t</sup>* <sup>→</sup> <sup>∞</sup> by (*vi*) of Lemma A1.

Because *λ*∗ *<sup>t</sup>*,*<sup>i</sup>* <sup>=</sup> *<sup>m</sup>λt*,*<sup>i</sup>* + (<sup>1</sup> <sup>−</sup> *<sup>m</sup>*)*λ*˜ *<sup>t</sup>*,*<sup>i</sup>* for some *<sup>m</sup>* <sup>∈</sup> (0, 1), it holds that (*λ*<sup>∗</sup> *t*,*i* )−<sup>1</sup> <sup>≤</sup> (*mλt*,*i*)−<sup>1</sup> <sup>≤</sup> (*mωL*)−1. Thus, we obtain

$$\begin{split} II\_{l}(\theta) &= (1+a) \left| f\_{\theta}^{\text{tr}}(\mathbf{Y}\_{l}|\boldsymbol{\lambda}\_{l}^{\*}) \left\{ \frac{\mathbf{Y}\_{l,1}}{\boldsymbol{\lambda}\_{l,1}^{\*}} - 1 + \frac{c\delta e^{-c\boldsymbol{\lambda}\_{l,1}^{\*}} \boldsymbol{u}(\mathbf{Y}\_{l,2}, \boldsymbol{\lambda}\_{l,2}^{\*})}{\varrho(\mathbf{Y}\_{l}, \boldsymbol{\lambda}\_{l}^{\*}, \delta)} \right\} (\boldsymbol{\lambda}\_{l,1} - \boldsymbol{\lambda}\_{l,1}) \\ &+ f\_{\theta}^{\text{tr}}(\mathbf{Y}\_{l}|\boldsymbol{\lambda}\_{l}^{\*}) \left\{ \frac{\mathbf{Y}\_{l,2}}{\boldsymbol{\lambda}\_{l,2}^{\*}} - 1 + \frac{c\delta e^{-c\boldsymbol{\lambda}\_{l,2}^{\*}} \boldsymbol{u}(\mathbf{Y}\_{l,1}, \boldsymbol{\lambda}\_{l,1}^{\*})}{\varrho(\mathbf{Y}\_{l}, \boldsymbol{\lambda}\_{l,2}^{\*}, \delta)} \right\} (\boldsymbol{\lambda}\_{l,2} - \boldsymbol{\bar{\lambda}}\_{l,2}) \Bigg| \\ &\leq (1+a) \left\{ \left( \frac{\mathbf{Y}\_{l,1}}{m\omega\_{L}} + 1 + \frac{2c\delta\_{l}}{q\_{L}} \right) |\boldsymbol{\lambda}\_{l,1} - \boldsymbol{\bar{\lambda}}\_{l,1}| + \left( \frac{\mathbf{Y}\_{l,2}}{m\omega\_{L}} + 1 + \frac{2c\delta\_{l}}{q\_{L}} \right) |\boldsymbol{\lambda}\_{l,2} - \boldsymbol{\bar{\lambda}}\_{l,2}| \right\}. \end{split}$$

According to Lemma 2.1 in Straumann and Mikosch [45], together with (*vi*) of Lemma A1, sup*θ*∈<sup>Θ</sup> *I It*(*θ*) converges to 0 a.s. as *<sup>t</sup>* <sup>→</sup> <sup>∞</sup>. Therefore, the lemma is verified.

**Lemma A3.** *Under (A1)–(A3), we have*

$$E\left(\sup\_{\theta \in \Theta} |h\_{\mathfrak{a},t}(\theta)|\right) < \infty \text{ and if } \theta \neq \theta^0, \text{ then } \mathbb{E}(h\_{\mathfrak{a},t}(\theta)) > \mathbb{E}(h\_{\mathfrak{a},t}(\theta^0)).$$

**Proof.** Because

$$|h\_{\mathfrak{a},t}(\theta)| \le \sum\_{y\_1=0}^{\infty} \sum\_{y\_2=0}^{\infty} f\_{\theta}^{1+\mathfrak{a}}(y|\lambda\_{t}) + \left(1 + \frac{1}{\mathfrak{a}}\right) f\_{\theta}^{\mathfrak{a}}(\mathcal{Y}\_{t}|\lambda\_{t}) \le 2 + \frac{1}{\mathfrak{a}'} $$

the first part of the lemma is validated. Note that

$$\begin{aligned} &E(h\_{a,t}(\theta)) - E(h\_{a,t}(\theta^0)) \\ &= -E\left[E\left\{h\_{a,t}(\theta) - h\_{a,t}(\theta^0) | \mathcal{F}\_{t-1}\right\}\right] \\ &= -E\left[\sum\_{y\_1=0}^{\infty} \sum\_{y\_2=0}^{\infty} \left\{f\_{\theta}^{1+a}(y|\lambda\_t) - \left(1 + \frac{1}{a}\right)f\_{\theta}^a(y|\lambda\_t)f\_{\theta^0}(y|\lambda\_t) + \frac{1}{a}f\_{\theta^0}^{1+a}(y|\lambda\_t)\right\}\right] \\ &\ge -0, \end{aligned}$$

where the equality holds if and only if *δ* = *δ*<sup>0</sup> and *λ<sup>t</sup>* = *λ*<sup>0</sup> *<sup>t</sup>* a.s. Therefore, the second part of the lemma is established by (*ii*) of Lemma A1.

**Proof of Theorem 1.** We can write

$$\begin{aligned} &\sup\_{\theta\in\Theta} \left| \frac{1}{n} \sum\_{t=1}^n \tilde{h}\_{a,t}(\theta) - E(h\_{a,t}(\theta)) \right| \\ &\leq \sup\_{\theta\in\Theta} \left| \frac{1}{n} \sum\_{t=1}^n \tilde{h}\_{a,t}(\theta) - \frac{1}{n} \sum\_{t=1}^n h\_{a,t}(\theta) \right| + \sup\_{\theta\in\Theta} \left| \frac{1}{n} \sum\_{t=1}^n h\_{a,t}(\theta) - E(h\_{a,t}(\theta)) \right|. \end{aligned}$$

The first term on the RHS of the inequality converges to 0 a.s. from Lemma A2. Moreover, because *<sup>h</sup>α*,*t*(*θ*) is stationary and ergodic with *<sup>E</sup>*(sup*θ*∈<sup>Θ</sup> <sup>|</sup>*hα*,*t*(*θ*)|) <sup>&</sup>lt; <sup>∞</sup> by Lemma A3, the second term also converges to 0 a.s. Finally, as *E*(*hα*,*t*(*θ*)) has a unique minimum at *θ*<sup>0</sup> from Lemma A3, the theorem is established.

Now, we derive the first and second derivatives of *hα*,*t*(*θ*). The first derivatives are obtained as

$$\begin{array}{rcl}\frac{\partial h\_{a,t}(\theta)}{\partial \theta} &=& (1+\alpha) \Big(D\_{t,1}(\theta)s\_{t,1}(\theta\_1)^T, D\_{t,2}(\theta)s\_{t,2}(\theta\_2)^T, D\_{t,3}(\theta)\Big)^T\\ &=& (1+\alpha) \begin{pmatrix}D\_{t,1}(\theta)I\_4 & \mathbf{0}\_4 \times 4 & \mathbf{0}\_4 \times 1\\\mathbf{0}\_4 \times 4 & D\_{t,2}(\theta)I\_4 & \mathbf{0}\_4 \times 1\\\mathbf{0}\_1 \times 4 & \mathbf{0}\_1 \times 4 & D\_{t,3}(\theta) \end{pmatrix} \begin{pmatrix}s\_{t,1}(\theta\_1)\\s\_{t,2}(\theta\_2)\\1\end{pmatrix}\\ &:=& (1+\alpha)\mathbf{D}\_t(\theta)\mathbf{A}\_t(\theta),\end{array}$$

where *I*<sup>4</sup> denotes the 4 × 4 identity matrix, **0***<sup>m</sup>* <sup>×</sup> *<sup>n</sup>* means the *m* × *n* matrix with zero elements, and

$$\begin{split} s\_{l,l}(\boldsymbol{\theta}) &= \quad \frac{\partial \lambda\_{l,l}}{\partial \boldsymbol{\theta}\_{l}} \quad \text{for } \boldsymbol{i} = 1, 2, \\ D\_{l,l}(\boldsymbol{\theta}) &= \quad \sum\_{y\_{1}=0}^{\infty} \sum\_{y\_{2}=0}^{\infty} f\_{\boldsymbol{\theta}}^{1+\boldsymbol{a}}(\boldsymbol{y}|\boldsymbol{\lambda}\_{l}) \left\{ \frac{y\_{l}}{\lambda\_{l,l}} - 1 + \frac{c\delta e^{-c\lambda\_{l,l}} u(y\_{l}, \lambda\_{l,j})}{\varphi(\boldsymbol{y}, \lambda\_{l}, \boldsymbol{\delta})} \right\} \\ &- f\_{\boldsymbol{\theta}}^{\boldsymbol{a}}(Y\_{l}|\boldsymbol{\lambda}\_{l}) \left\{ \frac{Y\_{l,l}}{\lambda\_{l,l}} - 1 + \frac{c\delta e^{-c\lambda\_{l,l}} u(Y\_{l,l}, \lambda\_{l,j})}{\varphi(Y\_{l}, \lambda\_{l}, \boldsymbol{\delta})} \right\} \text{ for } (i, j) = (1, 2), (2, 1), \\ D\_{l,3}(\boldsymbol{\theta}) &= \quad \sum\_{y\_{1}=0}^{\infty} \sum\_{y\_{2}=0}^{\infty} f\_{\boldsymbol{\theta}}^{1+\boldsymbol{a}}(\boldsymbol{y}|\boldsymbol{\lambda}\_{l}) \frac{u(y\_{1}, \lambda\_{l,1}) u(y\_{2}, \lambda\_{l,2})}{\varphi(\boldsymbol{y}, \lambda\_{l}, \boldsymbol{\delta})} - f\_{\boldsymbol{\theta}}^{\boldsymbol{a}}(Y\_{l}|\boldsymbol{\lambda}\_{l}) \frac{u(Y\_{l,1}, \lambda\_{l,1}) u(Y\_{l,2}, \lambda\_{l,2})}{\varphi(Y\_{l}, \lambda\_{l,\delta})}. \end{split}$$

The second derivatives are expressed as

$$\begin{array}{cccc}\partial^2 h\_{t,1}(\theta) &=& (1+a) \begin{pmatrix} F\_{l,11}(\theta)s\_{l,1}(\theta\_1)s\_{l,1}(\theta\_1)^T & F\_{l,12}(\theta)s\_{l,1}(\theta\_1)s\_{l,2}(\theta\_2)^T & F\_{l,13}(\theta)s\_{l,1}(\theta\_1) \\ F\_{l,21}(\theta)s\_{l,2}(\theta\_2)s\_{l,1}(\theta\_1)^T & F\_{l,22}(\theta)s\_{l,2}(\theta\_2)s\_{l,2}(\theta\_2)^T & F\_{l,23}(\theta)s\_{l,2}(\theta\_2) \\ F\_{l,31}(\theta)s\_{l,1}(\theta\_1)^T & F\_{l,32}(\theta)s\_{l,2}(\theta\_2)^T & F\_{l,33}(\theta) \end{pmatrix} \\ &+ (1+a) \begin{pmatrix} D\_{l,1}(\theta)s\_{l,11}(\theta\_1) & \mathbf{0}\_4 \times \mathbf{4} & \mathbf{0}\_4 \times \mathbf{1} \\ \mathbf{0}\_4 \times 4 & D\_{l,2}(\theta)s\_{l,22}(\theta\_2) & \mathbf{0}\_4 \times \mathbf{1} \\ \mathbf{0}\_1 \times 4 & \mathbf{0}\_1 \times 4 & 0 \end{pmatrix} \\ &:= \quad (1+a) \begin{cases} F\_{l}(\theta) + D\_{l}(\theta)\frac{\partial \Lambda\_{l}(\theta)}{\partial \theta^T} \end{cases}$$

where

*st*,*ii*(*θi*) = *<sup>∂</sup>*2*λt*,*<sup>i</sup> ∂θi∂θ<sup>T</sup> i* for *i* = 1, 2, *Ft*,*ii*(*θ*) = ∞ ∑ *y*1=0 ∞ ∑ *y*2=0 *f* <sup>1</sup>+*<sup>α</sup> <sup>θ</sup>* (*y*|*λt*) ⎡ ⎣(1 + *α*) *yi λt*,*<sup>i</sup>* <sup>−</sup> <sup>1</sup> <sup>+</sup> *<sup>c</sup>δe*−*cλt*,*iu*(*yj*, *<sup>λ</sup>t*,*j*) *ϕ*(*y*, *λt*, *δ*) 62 <sup>−</sup> *yi λ*2 *t*,*i* <sup>−</sup> *<sup>c</sup>*2*δe*−*cλt*,*iu*(*yj*, *<sup>λ</sup>t*,*j*) . 1 + *δe*−*yiu*(*yj*, *λt*,*j*) / *ϕ*(*y*, *λt*, *δ*)<sup>2</sup> ⎤ ⎦ <sup>−</sup>*<sup>f</sup> <sup>α</sup> <sup>θ</sup>* (*Yt*|*λt*) ⎡ ⎣*α Yt*,*<sup>i</sup> λt*,*<sup>i</sup>* <sup>−</sup> <sup>1</sup> <sup>+</sup> *<sup>c</sup>δe*−*cλt*,*iu*(*Yt*,*j*, *<sup>λ</sup>t*,*j*) *ϕ*(*Yt*, *λt*, *δ*) 62 <sup>−</sup> *Yt*,*<sup>i</sup> λ*2 *t*,*i* <sup>−</sup> *<sup>c</sup>*2*δe*−*cλt*,*iu*(*Yt*,*j*, *<sup>λ</sup>t*,*j*) . 1 + *δe*−*Yt*,*iu*(*Yt*,*j*, *λt*,*j*) / *ϕ*(*Yt*, *λt*, *δ*)<sup>2</sup> ⎤ ⎦ for (*i*, *j*)=(1, 2),(2, 1), *Ft*,33(*θ*) = *α* ∞ ∑ *y*1=0 ∞ ∑ *y*2=0 *f* <sup>1</sup>+*<sup>α</sup> <sup>θ</sup>* (*y*|*λt*) *u*(*y*1, *λt*,1)*u*(*y*2, *λt*,2) *ϕ*(*y*, *λt*, *δ*) <sup>2</sup> <sup>−</sup>(*<sup>α</sup>* <sup>−</sup> <sup>1</sup>)*<sup>f</sup> <sup>α</sup> <sup>θ</sup>* (*Yt*|*λt*) *u*(*Yt*,1, *λt*,1)*u*(*Yt*,2, *λt*,2) *ϕ*(*Yt*, *λt*, *δ*) <sup>2</sup> , *Ft*,12(*θ*) = ∞ ∑ *y*1=0 ∞ ∑ *y*2=0 *f* <sup>1</sup>+*<sup>α</sup> <sup>θ</sup>* (*y*|*λt*) (1 + *α*) *y*1 *λt*,1 <sup>−</sup> <sup>1</sup> <sup>+</sup> *<sup>c</sup>δe*−*cλt*,1*u*(*y*2, *<sup>λ</sup>t*,2) *ϕ*(*y*, *λt*, *δ*) 6 × *y*2 *λt*,2 <sup>−</sup> <sup>1</sup> <sup>+</sup> *<sup>c</sup>δe*−*cλt*,2*u*(*y*1, *<sup>λ</sup>t*,1) *ϕ*(*y*, *λt*, *δ*) 6 <sup>+</sup> *<sup>c</sup>*2*δe*−*c*(*λt*,1+*λt*,2) *ϕ*(*y*, *λt*, *δ*)<sup>2</sup> <sup>−</sup>*<sup>f</sup> <sup>α</sup> <sup>θ</sup>* (*Yt*|*λt*) *α Yt*,1 *λt*,1 <sup>−</sup> <sup>1</sup> <sup>+</sup> *<sup>c</sup>δe*−*cλt*,1*u*(*Yt*,2, *<sup>λ</sup>t*,2) *ϕ*(*Yt*, *λt*, *δ*) 6 × *Yt*,2 *λt*,2 <sup>−</sup> <sup>1</sup> <sup>+</sup> *<sup>c</sup>δe*−*cλt*,2*u*(*Yt*,1, *<sup>λ</sup>t*,1) *ϕ*(*Yt*, *λt*, *δ*) 6 <sup>+</sup> *<sup>c</sup>*2*δe*−*c*(*λt*,1+*λt*,2) *ϕ*(*Yt*, *λt*, *δ*)<sup>2</sup> , *Ft*,*i*3(*θ*) = ∞ ∑ *y*1=0 ∞ ∑ *y*2=0 *f* <sup>1</sup>+*<sup>α</sup> <sup>θ</sup>* (*y*|*λt*) (1 + *α*) *yi λt*,*<sup>i</sup>* <sup>−</sup> <sup>1</sup> <sup>+</sup> *<sup>c</sup>δe*−*cλt*,*iu*(*yj*, *<sup>λ</sup>t*,*j*) *ϕ*(*y*, *λt*, *δ*) 6 <sup>×</sup> *<sup>u</sup>*(*y*1, *<sup>λ</sup>t*,1)*u*(*y*2, *<sup>λ</sup>t*,2) *<sup>ϕ</sup>*(*y*, *<sup>λ</sup>t*, *<sup>δ</sup>*) <sup>+</sup> *ce*−*cλt*,*iu*(*yj*, *<sup>λ</sup>t*,*j*) *ϕ*(*y*, *λt*, *δ*)<sup>2</sup> <sup>−</sup>*<sup>f</sup> <sup>α</sup> <sup>θ</sup>* (*Yt*|*λt*) *α Yt*,*<sup>i</sup> λt*,*<sup>i</sup>* <sup>−</sup> <sup>1</sup> <sup>+</sup> *<sup>c</sup>δe*−*cλt*,*iu*(*Yt*,*j*, *<sup>λ</sup>t*,*j*) *ϕ*(*Yt*, *λt*, *δ*) 6 *u*(*Yt*,1, *λt*,1)*u*(*Yt*,2, *λt*,2) *ϕ*(*Yt*, *λt*, *δ*) +*ce*−*cλt*,*iu*(*Yt*,*j*, *<sup>λ</sup>t*,*j*) *ϕ*(*Yt*, *λt*, *δ*)<sup>2</sup> for (*i*, *j*)=(1, 2),(2, 1).

The following four lemmas are helpful for proving Theorem 2.

**Lemma A4.** *Let <sup>D</sup>*>*t*,*i*(*θ*) *denote the counterpart of Dt*,*i*(*θ*) *by substituting <sup>λ</sup><sup>t</sup> with <sup>λ</sup>*˜ *<sup>t</sup> for <sup>i</sup>* <sup>=</sup> 1, 2, 3*. Subsequently, under (A1)–(A3), we have that for i* = 1, 2*,*

$$\begin{aligned} |D\_{t,i}(\theta)| &\leq \mathcal{C}(\mathbf{Y}\_{t,i}+1), \ |\dot{D}\_{t,i}(\theta)| \leq \mathcal{C}(\mathbf{Y}\_{t,i}+1), \ |D\_{t,3}(\theta)| \leq \mathcal{C}, \ |\dot{D}\_{t,3}(\theta)| \leq \mathcal{C},\\ |F\_{t,ii}(\theta)| &\leq \mathcal{C}(\mathbf{Y}\_{t,i}^2+\mathbf{Y}\_{t,i}+1), \ |F\_{t,3}(\theta)| \leq \mathcal{C}, \ |F\_{t,12}(\theta)| \leq \mathcal{C}(\mathbf{Y}\_{t,1}\mathbf{Y}\_{t,2}+\mathbf{Y}\_{t,1}+\mathbf{Y}\_{t,2}+1),\\ |F\_{t,3}(\theta)| &\leq \mathcal{C}(\mathbf{Y}\_{t,i}+1), \end{aligned}$$

*and for* (*i*, *j*)=(1, 2),(2, 1)*,*

$$\begin{split} |D\_{t,i}(\theta) - \tilde{D}\_{t,i}(\theta)| &\leq \mathbb{C}(Y\_{t,i}^{2} + Y\_{t,i} + 1)|\lambda\_{t,i} - \tilde{\lambda}\_{t,i}| \\ &+ \mathbb{C}(Y\_{t,1}Y\_{t,2} + Y\_{t,1} + Y\_{t,2} + 1)|\lambda\_{t,j} - \tilde{\lambda}\_{t,j}|, \\ |D\_{t,3}(\theta) - \tilde{D}\_{t,3}(\theta)| &\leq \mathbb{C}(Y\_{t,1} + 1)|\lambda\_{t,1} - \tilde{\lambda}\_{t,1}| + \mathbb{C}(Y\_{t,2} + 1)|\lambda\_{t,2} - \tilde{\lambda}\_{t,2}|. \end{split}$$

*where C is some positive constant.*

**Proof.** From **(A1)**–**(A3)** and the fact that *λ*−<sup>1</sup> *<sup>t</sup>*,*<sup>i</sup>* <sup>≤</sup> *<sup>ω</sup>*−<sup>1</sup> *<sup>L</sup>* , we obtain

$$\begin{split} |D\_{t,i}(\theta)| &\leq \sum\_{y\_{i}=0}^{\infty} \sum\_{y\_{2}=0}^{\infty} f\_{\theta}(y|\lambda\_{t}) \left\{ \frac{y\_{i}}{\lambda\_{t,i}} + 1 + \frac{c|\delta|e^{-c\lambda\_{t,i}}|u(y\_{j},\lambda\_{t,j})|}{q(y,\lambda\_{t},\delta)} \right\} \\ &+ \left\{ \frac{Y\_{t,i}}{\lambda\_{t,i}} + 1 + \frac{c|\delta|e^{-c\lambda\_{t,i}}|u(Y\_{t,j},\lambda\_{t,j})|}{q(Y\_{t},\lambda\_{t},\delta)} \right\} \\ &\leq \quad 1 + 1 + \frac{2c\delta\_{lI}}{q\_{L}} + \frac{Y\_{t,i}}{\omega\_{L}} + 1 + \frac{2c\delta\_{lI}}{q\_{L}} \\ &= \quad \frac{Y\_{t,i}}{\omega\_{L}} + 3 + \frac{4c\delta\_{lI}}{q\_{L}} \end{split}$$

for (*i*, *j*)=(1, 2),(2, 1) and

$$\begin{split} |D\_{t,3}(\theta)| &\leq \sum\_{y\_1=0}^{\infty} \sum\_{y\_2=0}^{\infty} f\_{\theta}(y|\lambda\_t) \frac{|u(\underline{y}\_{1\prime}, \lambda\_{t,1})||u(\underline{y}\_{2\prime}, \lambda\_{t,2})|}{\overline{q(y\_{\prime}\lambda\_{t\prime}\delta)}} + \frac{|u(\underline{Y}\_{t,1\prime}\lambda\_{t,1})||u(\underline{Y}\_{t,2\prime}\lambda\_{t,2})|}{\overline{q(Y\_{t\prime}\lambda\_{t\prime}\delta)}} \\ &= \quad \frac{8}{\overline{q}\_L}. \end{split}$$

The second and fourth parts of the lemma also hold in the same manner. Furthermore, we can show that


for (*i*, *j*)=(1, 2),(2, 1), |*Ft*,33(*θ*)| ≤ *α* ∞ ∑ *y*1=0 ∞ ∑ *y*2=0 *f<sup>θ</sup>* (*y*|*λt*) *u*(*y*1, *λt*,1)*u*(*y*2, *λt*,2) *ϕ*(*y*, *λt*, *δ*) <sup>2</sup> +(1 + *α*) *u*(*Yt*,1, *λt*,1)*u*(*Yt*,2, *λt*,2) *ϕ*(*Yt*, *λt*, *δ*) <sup>2</sup> <sup>≤</sup> <sup>16</sup>*<sup>α</sup> ϕ*2 *L* + 16(1 + *α*) *ϕ*2 *L* <sup>=</sup> <sup>16</sup>(<sup>1</sup> <sup>+</sup> <sup>2</sup>*α*) *ϕ*2 *L* ,


and


for (*i*, *j*)=(1, 2),(2, 1).

Now, we prove the last two parts of the lemma. Because *Ft*,*ij*(*θ*) = *∂Dt*,*i*(*θ*)/*∂λt*,*<sup>j</sup>* for *i* = 1, 2, 3, *j* = 1, 2, owing to MVT, it holds that for *i* = 1, 2, 3,

$$\begin{split} \left| |D\_{t,i}(\theta) - \breve{D}\_{t,i}(\theta)| \right| &\leq \left| \left. \frac{\partial D\_{t,i}(\theta)}{\partial \lambda\_{t,1}} \right|\_{\lambda\_{t} = \lambda\_{t}^{\*}} \right| |\lambda\_{t,1} - \lambda\_{t,1}| + \left| \left. \frac{\partial D\_{t,i}(\theta)}{\partial \lambda\_{t,2}} \right|\_{\lambda\_{t} = \lambda\_{t}^{\*}} \right| |\lambda\_{t,2} - \lambda\_{t,2}| \\ &= \left| F\_{t,i1}(\theta) \right|\_{\lambda\_{t} = \lambda\_{t}^{\*}} \left| |\lambda\_{t,1} - \lambda\_{t,1}| + \left| F\_{t,i2}(\theta) \right|\_{\lambda\_{t} = \lambda\_{t}^{\*}} \right| |\lambda\_{t,2} - \lambda\_{t,2}|. \end{split}$$

where *Ft*,*ij*(*θ*) *λt*=*λ*∗ *t* is the same as *Ft*,*ij*(*θ*) with *λ<sup>t</sup>* replaced by *λ*<sup>∗</sup> *<sup>t</sup>* for *j* = 1, 2. Because (*λ*∗ *t*,*i* )−<sup>1</sup> <sup>≤</sup> *<sup>ω</sup>*−<sup>1</sup> *<sup>L</sup>* , it can be easily shown that *Ft*,*ij*(*θ*) *λt*=*λ*∗ *t* has the same upper bound as *Ft*,*ij*(*θ*) by following the aforementioned arguments. Therefore, the lemma is established.

**Lemma A5.** *Under (A1)–(A3), we have*

$$E\left(\sup\_{\theta \in \Theta} \left\| \frac{\partial^2 h\_{a,t}(\theta)}{\partial \theta \partial \theta^T} \right\|\_{1} \right) < \infty \text{ and } E\left(\sup\_{\theta \in \Theta} \left\| \frac{\partial h\_{a,t}(\theta)}{\partial \theta} \frac{\partial h\_{a,t}(\theta)}{\partial \theta^T} \right\|\_{1} \right) < \infty.$$

**Proof.** We can write

$$E\left(\sup\_{\theta\in\Theta} \left\|\frac{\partial^2 h\_{\mathfrak{a},\mathfrak{t}}(\theta)}{\partial\theta\partial\theta^T}\right\|\_{1}\right) \le (1+\alpha)\left\{ E\left(\sup\_{\theta\in\Theta} \|F\_{\mathfrak{t}}(\theta)\|\_{1}\right) + E\left(\sup\_{\theta\in\Theta} \left\|D\_{\mathfrak{t}}(\theta)\frac{\partial\Lambda\_{\mathfrak{t}}(\theta)}{\partial\theta^T}\right\|\_{1}\right) \right\}.$$

Hence, to show the first part of the lemma, it is sufficient to show that, for *i*, *j* = 1, 2,

$$\begin{split} &E\left(\sup\_{\boldsymbol{\theta}\in\Theta} \left\|F\_{\boldsymbol{t},\boldsymbol{j}}(\boldsymbol{\theta})\frac{\partial\lambda\_{\boldsymbol{t},\boldsymbol{j}}}{\partial\theta\_{\boldsymbol{i}}}\frac{\partial\lambda\_{\boldsymbol{t},\boldsymbol{j}}}{\partial\theta\_{\boldsymbol{j}}^{T}}\right\|\_{1}\right) < \infty, \; E\left(\sup\_{\boldsymbol{\theta}\in\Theta} \left\|F\_{\boldsymbol{t},\boldsymbol{3}}(\boldsymbol{\theta})\frac{\partial\lambda\_{\boldsymbol{t},\boldsymbol{j}}}{\partial\theta\_{\boldsymbol{i}}}\right\|\_{1}\right) < \infty, \\ &E\left(\sup\_{\boldsymbol{\theta}\in\Theta} |F\_{\boldsymbol{t},\boldsymbol{3}3}(\boldsymbol{\theta})|\right) < \infty, \; \text{and} \; E\left(\sup\_{\boldsymbol{\theta}\in\Theta} \left\|D\_{\boldsymbol{t},\boldsymbol{i}}(\boldsymbol{\theta})\frac{\partial^{2}\lambda\_{\boldsymbol{t},\boldsymbol{j}}}{\partial\theta\_{\boldsymbol{i}}\partial\theta\_{\boldsymbol{i}}^{T}}\right\|\_{1}\right) < \infty, \end{split}$$

which can be directly obtained from (*iii*) of Lemma A1, Lemma A4, and Cauchy–Schwarz inequality. For example,

$$\begin{split} &E\left(\sup\_{\theta\in\Theta} \left\|F\_{t,12}(\theta)\frac{\partial\lambda\_{t,1}}{\partial\theta\_{1}}\frac{\partial\lambda\_{t,2}}{\partial\theta\_{2}^{T}}\right\|\_{1}\right) \\ &\leq \left\{E\left(\sup\_{\theta\in\Theta}|F\_{t,12}(\theta)|\right)^{2}\right\}^{1/2}\left\{E\left(\sup\_{\theta\in\Theta} \left\|\frac{\partial\lambda\_{t,1}}{\partial\theta\_{1}}\frac{\partial\lambda\_{t,2}}{\partial\theta\_{2}^{T}}\right\|\_{1}\right)^{2}\right\}^{1/2} \\ &\leq \left[E\left\{\mathcal{C}(Y\_{t,1}Y\_{t,2}+Y\_{t,1}+Y\_{t,2}+1)\right\}^{2}\right]^{1/2} \\ &\quad\times \left\{E\left(\sup\_{\theta\_{1}\in\Theta\_{1}}\left\|\frac{\partial\lambda\_{t,1}}{\partial\theta\_{1}}\right\|\_{1}\right)^{4}\right\}^{1/4}\left\{E\left(\sup\_{\theta\_{2}\in\Theta\_{2}}\left\|\frac{\partial\lambda\_{t,2}}{\partial\theta\_{2}}\right\|\_{1}\right)^{4}\right\}^{1/4} \\ &<\infty \end{split}$$

and

$$\begin{split} &E\left(\sup\_{\theta\in\Theta} \left\|D\_{t,1}(\theta)\frac{\partial^2 \lambda\_{t,1}}{\partial\theta\_1\partial\theta\_1^T}\right\|\_{1}\right) \\ &\leq \quad \left\{E\left(\sup\_{\theta\in\Theta}|D\_{t,1}(\theta)|\right)^2\right\}^{1/2}\left\{E\left(\sup\_{\theta\_1\in\Theta\_1} \left\|\frac{\partial^2 \lambda\_{t,1}}{\partial\theta\_1\partial\theta\_1^T}\right\|\_{1}\right)^2\right\}^{1/2} \\ &\leq \quad \left[E\{C(Y\_{t,1}+1)\}^2\right]^{1/2}\left\{E\left(\sup\_{\theta\_1\in\Theta\_1} \left\|\frac{\partial^2 \lambda\_{t,1}}{\partial\theta\_1\partial\theta\_1^T}\right\|\_{1}\right)^2\right\}^{1/2} \\ &< \quad \infty. \end{split}$$

The second part of the lemma can be shown in the same manner.

**Lemma A6.** *Under (A1)–(A3), we have*

$$\frac{1}{\sqrt{n}} \sum\_{t=1}^{n} \sup\_{\theta \in \Theta} \left\| \frac{\partial h\_{a,t}(\theta)}{\partial \theta} - \frac{\partial \tilde{h}\_{a,t}(\theta)}{\partial \theta} \right\|\_{1} \stackrel{a.s.}{\longrightarrow} 0 \text{ as } n \to \infty.$$

**Proof.** Owing to (*iv*) and (*vi*) of Lemma A1 and Lemma A4, we obtain a.s.,

1 <sup>1</sup> <sup>+</sup> *<sup>α</sup>* sup *θ*∈Θ ? ? ? ? *∂hα*,*t*(*θ*) *∂θ* <sup>−</sup> *<sup>∂</sup>*˜ *hα*,*t*(*θ*) *∂θ* ? ? ? ? 1 ≤ sup *θ*∈Θ ? ? ? *D*> *<sup>t</sup>*(*θ*) ? ? ? 1 sup *θ*∈Θ ? ? ? Λ*t*(*θ*) − Λ>*t*(*θ*) ? ? ? 1 + sup *θ*∈Θ Λ*t*(*θ*)<sup>1</sup> sup *θ*∈Θ ? ? ? *Dt*(*θ*) − *D*> *<sup>t</sup>*(*θ*) ? ? ? 1 ≤ \$ 3 ∑ *i*=1 sup *θ*∈Θ |*D*>*t*,*i*(*θ*)| %\$ 2 ∑ *i*=1 sup *θi*∈Θ*<sup>i</sup>* ? ? ? ? ? *∂λt*,*<sup>i</sup> ∂θ<sup>i</sup>* <sup>−</sup> *∂λ*˜ *<sup>t</sup>*,*<sup>i</sup> ∂θ<sup>i</sup>* ? ? ? ? ? 1 % + \$ 2 ∑ *i*=1 sup *θi*∈Θ*<sup>i</sup>* ? ? ? ? *∂λt*,*<sup>i</sup> ∂θ<sup>i</sup>* ? ? ? ? 1 + 1 %\$ 3 ∑ *i*=1 sup *θ*∈Θ |*Dt*,*i*(*θ*) − *D*>*t*,*i*(*θ*)| % <sup>≤</sup> <sup>2</sup>*C*(*Yt*,1 <sup>+</sup> *Yt*,2 <sup>+</sup> <sup>3</sup>)*Vρ<sup>t</sup>* + \$ 2 ∑ *i*=1 sup *θi*∈Θ*<sup>i</sup>* ? ? ? ? *∂λt*,*<sup>i</sup> ∂θ<sup>i</sup>* ? ? ? ? 1 + 1 % × *C* . *Y*2 *<sup>t</sup>*,1 + *<sup>Y</sup>*<sup>2</sup> *<sup>t</sup>*,2 + 2*Yt*,1*Yt*,2 + 4(*Yt*,1 + *Yt*,2) + 6 / *Vρ<sup>t</sup>*

where *<sup>D</sup>*<sup>&</sup>gt; *<sup>t</sup>*(*θ*) and <sup>Λ</sup>>*t*(*θ*) are the same as *<sup>D</sup>t*(*θ*) and <sup>Λ</sup>*t*(*θ*) with *<sup>λ</sup><sup>t</sup>* replaced by *<sup>λ</sup>*˜ *<sup>t</sup>*. Therefore, from Lemma 2.1 in Straumann and Mikosch [45], together with (*iii*) of Lemma A1, the RHS of the last inequality converges to 0 exponentially fast a.s. and, thus, the lemma is validated. We refer the reader to Straumann and Mikosch [45] and Cui and Zheng [46] for more details on exponentially fast a.s. convergence.

,

**Lemma A7.** *Let* ˆ *θH <sup>α</sup>*,*<sup>n</sup>* <sup>=</sup> argmin*θ*∈<sup>Θ</sup> *<sup>H</sup>α*,*n*(*θ*)*. Subsequently, under (A1)–(A3), we have*

$$
\hat{\theta}\_{a,n}^H \stackrel{a.s.}{\longrightarrow} \theta^0 \text{ and } \sqrt{n}(\hat{\theta}\_{a,n}^H - \theta^0) \stackrel{d}{\longrightarrow} N(0, f\_a^{-1} K\_a f\_a^{-1}) \text{ as } n \to \infty.
$$

**Proof.** As seen in the proof of Theorem 1, sup*θ*∈<sup>Θ</sup> *n*−<sup>1</sup> <sup>∑</sup>*<sup>n</sup> <sup>t</sup>*=<sup>1</sup> *<sup>h</sup>α*,*t*(*θ*) <sup>−</sup> *<sup>E</sup>*(*hα*,*t*(*θ*)) converges to 0 a.s. and *E*(*hα*,*t*(*θ*)) has a unique minimum at *θ*0. Hence, the first part of the lemma is verified.

Next, we handle the second part. Let *θ*(*i*), *i* = 1, ... , 9 be the *i*-th element of *θ*. Using MVT, we have

$$0 = \frac{1}{\sqrt{n}} \sum\_{t=1}^{n} \frac{\partial h\_{a,t}(\theta^0)}{\partial \theta(i)} + \sqrt{n} (\theta\_{a,n}^H - \theta^0)^T \left( \frac{1}{n} \sum\_{t=1}^{n} \frac{\partial^2 h\_{a,t}(\theta\_{a,n,i}^\*)}{\partial \theta \partial \theta(i)} \right)^2$$

for some vector *θ*∗ *<sup>α</sup>*,*n*,*<sup>i</sup>* between *<sup>θ</sup>*<sup>0</sup> and <sup>ˆ</sup> *θH <sup>α</sup>*,*n*, so that, eventually, we can write

$$0 = \frac{1}{\sqrt{n}} \sum\_{t=1}^{n} \frac{\partial h\_{\boldsymbol{a},t}(\boldsymbol{\theta}^{0})}{\partial \boldsymbol{\theta}} + \sqrt{n} (\boldsymbol{\hat{\theta}}\_{\boldsymbol{a},\boldsymbol{u}}^{H} - \boldsymbol{\theta}^{0})^{T} \left( \frac{1}{n} \sum\_{t=1}^{n} \frac{\partial^{2} h\_{\boldsymbol{a},t}(\boldsymbol{\theta}\_{\boldsymbol{a},\boldsymbol{u}}^{\*})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^{T}} \right),$$

where the term *∂*2*hα*,*t*(*θ*<sup>∗</sup> *<sup>α</sup>*,*n*)/*∂θ∂θ<sup>T</sup>* actually represents a 9 <sup>×</sup> 9 matrix whose (*i*, *<sup>j</sup>*)-th entry is *∂*2*hα*,*t*(*θ*<sup>∗</sup> *<sup>α</sup>*,*n*,*ij*)/*∂θ*(*i*)*∂θ*(*j*) for some vector *θ*<sup>∗</sup> *<sup>α</sup>*,*n*,*ij* between *<sup>θ</sup>*<sup>0</sup> and <sup>ˆ</sup> *θH <sup>α</sup>*,*n*. We first show that

$$\frac{1}{\sqrt{n}}\sum\_{t=1}^{n}\frac{\partial h\_{a,t}(\theta^{0})}{\partial \theta} \stackrel{d}{\longrightarrow} N(0, K\_{a}).\tag{A1}$$

For *ν* = (*ν<sup>T</sup>* <sup>1</sup> , *<sup>ν</sup><sup>T</sup>* <sup>2</sup> , *<sup>ν</sup>*3)*<sup>T</sup>* <sup>∈</sup> <sup>R</sup><sup>4</sup> <sup>×</sup> <sup>R</sup><sup>4</sup> <sup>×</sup> <sup>R</sup>, we obtain

$$\begin{split} E\left(\nu^{T}\frac{\partial h\_{u,t}(\theta^{0})}{\partial\theta}\Big|\mathcal{F}\_{t-1}\right) &= \quad (1+a)\left\{\nu^{T}\_{1}\frac{\partial\lambda^{0}\_{t,1}}{\partial\theta\_{1}}E\left(D\_{t,1}(\theta^{0})|\mathcal{F}\_{t-1}\right) \\ &+ \nu^{T}\_{2}\frac{\partial\lambda^{0}\_{t,2}}{\partial\theta\_{2}}E\left(D\_{t,2}(\theta^{0})|\mathcal{F}\_{t-1}\right) + \nu\_{3}E\left(D\_{t,3}(\theta^{0})|\mathcal{F}\_{t-1}\right)\right\} \\ &= \quad 0 \end{split}$$

and

$$E\left(\nu^T \frac{\partial h\_{\mathfrak{a},t}(\theta^0)}{\partial \theta}\right)^2 = \nu^T E\left(\frac{\partial h\_{\mathfrak{a},t}(\theta^0)}{\partial \theta} \frac{\partial h\_{\mathfrak{a},t}(\theta^0)}{\partial \theta^T}\right) \nu < \infty$$

by Lemma A5. Hence, it follows from the central limit theorem in Billingsley [47] that

$$\frac{1}{\sqrt{n}}\sum\_{t=1}^n \nu^T \frac{\partial h\_{\mathfrak{A},t}(\theta^0)}{\partial \theta} \stackrel{d}{\longrightarrow} N(0, \nu^T \mathcal{K}\_{\mathfrak{A}} \nu)\_{\nu}$$

which implies (A1).

Now, we claim that

$$-\frac{1}{n}\sum\_{t=1}^{n}\frac{\partial^2 h\_{n,t}(\theta\_{a,n,ij}^\*)}{\partial \theta(i)\partial \theta(j)} \xrightarrow{a.s.} f\_{a',r}^{ij} \tag{A2}$$

where *J ij <sup>α</sup>* denotes the (*i*, *j*)-th entry of *Jα*. From Lemma A5, *J<sup>α</sup>* is finite. Further, after some algebras, we have

$$\begin{split} &\quad \boldsymbol{\nu}^{T}(-f\_{a})\boldsymbol{\nu} \\ &= \quad (1+a)E\left[\sum\_{y\_{1}=0}^{\infty}\sum\_{y\_{2}=0}^{\infty}f\_{\vartheta^{0}}^{1+x}(\boldsymbol{y}|\lambda\_{t})\left\{ \left(\boldsymbol{\nu}\_{1}^{T}\frac{\partial\lambda\_{t,1}^{0}}{\partial\theta\_{1}}\right)\left(\frac{\boldsymbol{y}\_{1}}{\lambda\_{t,1}^{0}}-1+\frac{c\delta^{0}e^{-c\lambda\_{t,1}^{0}}u(\boldsymbol{y}\_{2},\lambda\_{t,2}^{0})}{\rho(\boldsymbol{y},\lambda\_{t}^{0},\delta^{0})}\right) \right\} \\ &\quad + \left(\nu\_{2}^{T}\frac{\partial\lambda\_{t,2}^{0}}{\partial\theta\_{2}}\right)\left(\frac{\boldsymbol{y}\_{2}}{\lambda\_{t,2}^{0}}-1+\frac{c\delta^{0}e^{-c\lambda\_{t,2}^{0}}u(\boldsymbol{y}\_{1},\lambda\_{t,1}^{0})}{\rho(\boldsymbol{y},\lambda\_{t}^{0},\delta^{0})}\right) \\ &\quad + \nu\_{3}\frac{u(\boldsymbol{y}\_{1},\lambda\_{t,1}^{0})u(\boldsymbol{y}\_{2},\lambda\_{t,2}^{0})}{\rho(\boldsymbol{y},\lambda\_{t}^{0},\delta^{0})}\right)^{2}\Big] > 0 \\ \end{split}$$

by (*v*) of Lemma A1, which implies that *J<sup>α</sup>* is non-singular. Note that we can write

$$\begin{split} & \left| \frac{1}{n} \sum\_{t=1}^{n} \frac{\partial^2 h\_{\boldsymbol{\theta},t}(\boldsymbol{\theta}^\*\_{\boldsymbol{a},\boldsymbol{n},\boldsymbol{j}})}{\partial \theta(\boldsymbol{i}) \partial \theta(\boldsymbol{j})} - E \left( \frac{\partial^2 h\_{\boldsymbol{a},t}(\boldsymbol{\theta}^0)}{\partial \theta(\boldsymbol{i}) \partial \theta(\boldsymbol{j})} \right) \right| \\ & \leq \sup\_{\boldsymbol{\theta} \in \Theta} \left| \frac{1}{n} \sum\_{t=1}^{n} \frac{\partial^2 h\_{\boldsymbol{a},t}(\boldsymbol{\theta})}{\partial \theta(\boldsymbol{i}) \partial \theta(\boldsymbol{j})} - E \left( \frac{\partial^2 h\_{\boldsymbol{a},t}(\boldsymbol{\theta})}{\partial \theta(\boldsymbol{i}) \partial \theta(\boldsymbol{j})} \right) \right| + \left| E \left( \frac{\partial^2 h\_{\boldsymbol{a},t}(\boldsymbol{\theta}^\*\_{\boldsymbol{a},\boldsymbol{n},\boldsymbol{j}})}{\partial \theta(\boldsymbol{i}) \partial \theta(\boldsymbol{j})} \right) - E \left( \frac{\partial^2 h\_{\boldsymbol{a},t}(\boldsymbol{\theta}^0)}{\partial \theta(\boldsymbol{i}) \partial \theta(\boldsymbol{j})} \right) \right|. \end{split}$$

Because *∂*2*hα*,*t*(*θ*)/*∂θ*(*i*)*∂θ*(*j*) is stationary and ergodic, from Lemma A5, the first term on the RHS of the inequality converges to 0 a.s. Moreover, the second term converges to 0 by the dominated convergence theorem. Hence, (A2) is asserted. Therefore, from (A1) and (A2), the second part of the lemma is established.

**Proof of Theorem 2.** From MVT, we get

$$\frac{1}{n}\sum\_{t=1}^{n}\frac{\partial h\_{\mathfrak{a},t}(\boldsymbol{\theta}\_{\mathfrak{a},\mathfrak{u}}^{H})}{\partial\boldsymbol{\theta}(\boldsymbol{i})} - \frac{1}{n}\sum\_{t=1}^{n}\frac{\partial h\_{\mathfrak{a},t}(\boldsymbol{\theta}\_{\mathfrak{a},\mathfrak{u}})}{\partial\boldsymbol{\theta}(\boldsymbol{i})} = (\widehat{\boldsymbol{\theta}}\_{\mathfrak{a},\mathfrak{u}}^{H} - \widehat{\boldsymbol{\theta}}\_{\mathfrak{a},\mathfrak{u}})^{T} \left(\frac{1}{n}\sum\_{t=1}^{n}\frac{\partial^{2}h\_{\mathfrak{a},t}(\boldsymbol{\zeta}\_{\mathfrak{a},\mathfrak{u},\mathfrak{u}})}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}(\boldsymbol{i})}\right)^{T}$$

for some vector *ζα*,*n*,*<sup>i</sup>* between ˆ *θH <sup>α</sup>*,*<sup>n</sup>* and ˆ *θα*,*n*. Thus, we can write

$$\frac{1}{n}\sum\_{t=1}^{n}\frac{\partial h\_{a,t}(\hat{\theta}\_{a,n}^{H})}{\partial\theta} - \frac{1}{n}\sum\_{t=1}^{n}\frac{\partial h\_{a,t}(\hat{\theta}\_{a,n})}{\partial\theta} = (\hat{\theta}\_{a,n}^{H} - \hat{\theta}\_{a,n})^{T}\left(\frac{1}{n}\sum\_{t=1}^{n}\frac{\partial^{2}h\_{a,t}(\check{\zeta}\_{a,n})}{\partial\theta\partial\theta^{T}}\right)\zeta$$

where the (*i*, *j*)-th entry of *∂*2*hα*,*t*(*ζα*,*n*)/*∂θ∂θ<sup>T</sup>* is *∂*2*hα*,*t*(*ζα*,*n*,*ij*)/*∂θ*(*i*)*∂θ*(*j*) for some vector *ζα*,*n*,*ij* between ˆ *θH <sup>α</sup>*,*<sup>n</sup>* and ˆ *θα*,*n*. Since *n*−<sup>1</sup> ∑*<sup>n</sup> <sup>t</sup>*=<sup>1</sup> *<sup>∂</sup>hα*,*t*(<sup>ˆ</sup> *θH <sup>α</sup>*,*n*)/*∂θ* = 0 and *n*−<sup>1</sup> ∑*<sup>n</sup> <sup>t</sup>*=<sup>1</sup> *<sup>∂</sup>*˜ *hα*,*t*(ˆ *θα*,*n*)/*∂θ* = 0, we have

$$\frac{1}{\sqrt{n}}\sum\_{t=1}^{n}\frac{\partial \tilde{l}\_{a,t}(\hat{\theta}\_{a,n})}{\partial \theta} - \frac{1}{\sqrt{n}}\sum\_{t=1}^{n}\frac{\partial h\_{a,t}(\hat{\theta}\_{a,n})}{\partial \theta} = \sqrt{n}(\hat{\theta}\_{a,n}^{H} - \hat{\theta}\_{a,n})^{T}\left(\frac{1}{n}\sum\_{t=1}^{n}\frac{\partial^{2}h\_{a,t}(\check{\zeta}\_{a,n})}{\partial \theta \partial \theta^{T}}\right).$$

The LHS of the above equation converges to 0 a.s. by Lemma A6, and *n*−<sup>1</sup> ∑*<sup>n</sup> <sup>t</sup>*=<sup>1</sup> *∂*2*hα*,*t*(*ζα*,*n*)/ *∂θ∂θ<sup>T</sup>* converges to <sup>−</sup>*J<sup>α</sup>* a.s. in a similar way as in the proof of Lemma A7. Therefore, the theorem is established due to Lemma A7.

#### **References**


### *Article* **Multivariate Count Data Models for Time Series Forecasting**

**Yuliya Shapovalova 1,\*, Nalan Ba¸sturk ¨ <sup>2</sup> and Michael Eichler <sup>2</sup>**


**\*** Correspondence: yuliya.shapovalova@ru.nl

**Abstract:** Count data appears in many research fields and exhibits certain features that make modeling difficult. Most popular approaches to modeling count data can be classified into observation and parameter-driven models. In this paper, we review two models from these classes: the loglinear multivariate conditional intensity model (also referred to as an integer-valued generalized autoregressive conditional heteroskedastic model) and the non-linear state-space model for count data. We compare these models in terms of forecasting performance on simulated data and two real datasets. In simulations, we consider the case of model misspecification. We find that both models have advantages in different situations, and we discuss the pros and cons of inference for both models in detail.

**Keywords:** multivariate count data; INGACRCH; state-space model; bank failures; transactions

**Citation:** Shapovalova, Y.; Ba¸sturk, ¨ N.; Eichler, M. Multivariate Count Data Models for Time Series Forecasting. *Entropy* **2021**, *23*, 718. https://doi.org/10.3390/e23060718

Academic Editor: Christian H. Weiss

Received: 30 April 2021 Accepted: 2 June 2021 Published: 5 June 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Modeling time series of counts is relevant in a range of application areas, including the dynamics of the number of infectious diseases, number of road accidents or number of bank failures. In many applications, such count data dynamics are correlated across several data series. Examples include from correlated number of bank failures [1], number of crimes [2] to COVID-19 contagion dynamics [3]. The analysis of such correlations provides detailed information about the overall connectedness of the series, as well as the dynamics of an individual series conditional on the others. Several multivariate count data models have been proposed to capture the overall connectedness of multivariate count data. Each one of these models has different underlying assumptions as well as computational challenges. We present a comparative study of two families of multivariate count data models, namely State Space Models (SSM) and log-linear multivariate autoregressive conditional intensity (MACI) models, based on simulation studies and two empirical applications.

We provide some examples of the count data and discuss particular properties that one desires to model when dealing with such data. In this paper, we assume that the counts are unbounded and we assume both models to be stationary. For discussion on the difference between bounded and unbounded count data and the difference of the modeling approaches for these data we refer to [4]. The top panels in Figure 1 present two conventional data sets that have been used for univariate illustrations, namely the monthly number of cases of poliomyelitis in the U.S. between 1970 and 1983, and asthma presentations at a Sydney hospital. The middle panel in Figure 1 presents the number of bank failures in the U.S. over time, a dataset that we also analyze in this paper, and the number of transactions for BMW in a 30 second interval. The bottom panel in Figure 1 presents a number of car crashes and a number of earthquakes. The former, number of car crashes over time is analyzed in Park and Lord [5] with a multivariate Poisson log-normal model with correlations for modeling the crash frequency by severity. The authors demonstrate that, accounting for the correlations in the multivariate model can improve the accuracy of

the estimation. A common feature in all presented datasets is the autocorrelation present in count data over time that is visible in the time series plots. In multivariate count time series data, this correlation generalizes to a correlation between past and current values of a specific series as well as between different series.

**Figure 1.** Typical examples of count data coming from applications in different scientific fields. (**a**) Monthly number of cases of poliomyelitis in the U.S. (1970–1983). (**b**) Asthma presentations at a Sydney hospital (**c**) Number of bank failures in US (**d**) Number of transactions for BMW on 30 s interval (**e**) Number of car crashes (**f**) Number of earthquakes.

Models for multivariate count time series typically rely on multivariate Poisson distributions, where time-variation is defined through one or more rate parameters [6]. In some cases, Gaussian approximations are used but, as has been shown in [7], this can lead to reduced performance in the risk forecasting assessment. In general, the quality of such approximations depends on a particular problem [8]. Estimation of these models

is computationally demanding for high numbers of counts as the estimation relies on the sum over all counts. In addition, these models typically have positivity restrictions on the conditional intensity function that governs the Poisson process over time and the correlation between different time series. A few exceptions to the positive correlation assumption exist, see for example, [9,10].

An alternative model for the joint distribution of count data is the copula model. A number of papers proposed different copula models for multivariate count time series, see, for example, [11–14]. Copulas are generally used for modeling dependency in multiple series, which makes them attractive methods also for multiple count time series. However, several issues arise in their applications to count data such as unidentifiability and not accounting for potential overdispersion—a property that is common for count data. Genest and Nešlehová [15] provides a detailed overview of copula models for count data, and proposes and compares Bayesian and classical inferential approaches for multivariate Poisson regression. They show that computationally Bayesian and classical approaches are of a similar order.

Both approaches of modeling joint distribution of count data—multivariate Poisson distribution and copulas—can be incorporated in the autoregressive conditional intensity (ACI) framework, often also referred to as an integer-valued generalized autoregressive conditional heteroskedasticity model (INGARCH). This model belongs to the class of observation driven models as opposed to parameter driven models, a classification proposed by Cox et al. [16]. ACI models have been dominating the literature for quite a long time despite their restrictiveness: these models only allow only for positive coefficients in the equation for conditional intensity. These bounded coefficients lead to several problems besides potentially unrealistic dependence structure for some data. In particular, the problem of calculating confidence sets for the parameters that are close to or on the boundary rises and has not been yet solved in the literature. Another observation driven model that has been proposed as an alternative to ACI framework is log-linear model, see Fokianos and Tjøstheim [17], a multivariate extension of which has been considered in Doukhan et al. [10]. Even though the problem of modeling joint distribution remains, the advantage of this approach is that no restrictions on the parameter space are required due to the log-transform of the data.

Another class of models that can be considered for modeling count data, but is rarely used in the literature, is parameter driven models and, in particular, non-linear state-space models. In this framework, the observations are driven by an independent unobserved stochastic process which, for instance, can be a (vector) autoregressive process (*VAR*(*p*)). These models have been discussed extensively in the univariate case, see, for example Davis et al. [18]. However, these models are rarely used in multivariate applications due to the computationally demanding estimation methods that have to be used. To our knowledge, only one very recent study has considered them in a multivariate application Zhang et al. [19]. Non-linear state-space models are capable of modeling and inferring complex dependence structures in the data. They allow for both negative and positive contemporaneous correlation, as well as for both negative and positive Granger-causal feedback. Thereby, these models avoid the problem of modeling the joint distribution of time series of counts and provide a coherent inferential tool in the Bayesian framework. This is what distinguishes our approach from the approach discussed in Zhang et al. [19] who consider frequentist estimation of these models. We also compare SSM to log-linear models instead of MACI models since they allow for negative dependence between the intensities and hence appear to be more natural competitors of SSM models.

In this paper, we compare two classes of models, observation driven and parameter driven models, in terms of their forecasting performances. We estimate the observation driven models the quasi-maximum likelihood method. Parameter driven models, however, fit very well into the Bayesian paradigm and that is what we use for estimation. Certain advantages come together with this framework, such as those naturally obtained from the posterior distribution uncertainty about the parameters of the model and forecast of

the multivariate time series [20]. We in particular use particle Markov chain Monte Carlo (pMCMC) [21] for the estimation of the parameter driven model. As is discussed in [22], pMCMC outperforms other methods (variational Bayes [23], integrated nested Laplace approximation [24] and Riemann manifold Hamiltonian Monte Carlo [25]) in terms of parameter estimation. There are other recent methods for the estimation of the state-space models such as auxiliary likelihood-based approximate Bayesian computation [26] and variational Sequential Monte Carlo [27], but their performance has to be investigated further, which is outside of the scope of this paper.

We present a set of simulation studies to show how these models perform when they are correctly specified and misspecified. The simulation results show that, as expected, the correctly specified models perform generally well, but there are exceptions. Particularly, parameter driven models have better forecast performances in some simulations even if they are misspecified. In addition to these simulation studies, we compare the performances of these models in two real data applications. The two data sets we analyze exhibit different sample sizes, standard deviation, dispersion and maximum counts. We show that the overall forecast performances of the models can be very different, depending on the applications. Furthermore, for the second data set we analyze, we find that observation driven models capture extreme data values better than parameter driven models.

The remainder of this paper is as follows: Sections 2 and 3 summarize observation and parameter driven models, respectively. Section 4 presents the model and forecast comparison tools we use for multivariate count data models. Section 5 presents simulation results. Section 6 presents results from applications to two data sets. Finally, Section 7 concludes the paper.

#### **2. Observation Driven Models**

In this section, we summarize two observation driven models: multivariate autoregressive conditional intensity model and log-linear analog of it. Both of these models are characterized by the dynamics that depend on the past of the process itself and some noise. Both models have been considered in Doukhan et al. [10], where the authors discussed some theoretical properties and proposed to use copula approach for modeling joint count distribution. Copulas are flexible tools for modeling dependence structure but their use in count time series models brings challenges. We first summarize the use of Poisson distribution for count data, analyze both models under an independence assumption in the Poisson random variables, and at the end of this section, we discuss the extension of modeling multiple count time series with multivariate Poisson distribution.

#### *2.1. Poisson Distribution*

Many of the count time series models take their origins in the idea of Poisson regression model, an extensive overview of these models is given in Fokianos [28]. Specifically, both models considered in this section as well as the parameter driven models in Section 3 rely on Poisson distributions. We therefore first provide some background on the Poisson distribution. Poisson distribution has played an important role in modeling count time series data as its interpretation is the number of independent events that occur in a time period. The Poisson distribution is defined for a random variable *x* takes integer values in {0, 1, ... }. The mean of the distribution, *λ*, describes the average occurrences per interval, the distribution has the equi-dispersion property since the variance its variance is also *λ*, and the probability mass function (pmf) of the distribution is

$$p(\mathbf{x}) = \frac{\lambda^x \mathbf{e}^{-\lambda}}{\mathbf{x}!}, \ \mathbf{x} = 0, 1, 2, 3 \dots, \tag{1}$$

with E(*x*) = *Var*(*x*) = *λ*.

For the simple multivariate case, two Poisson random variables, say *x*<sup>1</sup> and *x*2, the joint pmf reads

$$p(\mathbf{x}\_1, \mathbf{x}\_2) = \prod\_{i=1}^2 \frac{e^{-\lambda\_i}\lambda\_i^{x\_i}}{\mathbf{x}\_i!}. \tag{2}$$

The multivariate extension in (2) is rather naive due to the underlying independence assumption between *x*<sup>1</sup> and *x*2. Such a model would ignore potential dependency of the real world data, thus is potentially not suitable for the majority of applications. One way to use the Poisson distribution for modeling count data in multivariate case and incorporate correlation structure is through the so-called *trivariate reduction* [13,29]. The idea is that correlation can be modeled through the third Poisson variable. Assume we have three independent random variables *xi* ∼ Poisson(*λt*,*<sup>i</sup>* − *ϕ*), where 0 ≤ *ϕ* ≤ min{*λt*,1, *λt*,2}. Define *Yt*,1 = *X*<sup>1</sup> + *X*<sup>2</sup> and *Yt*,2 = *X*<sup>2</sup> + *X*3. In this way, the random variable *X*<sup>2</sup> is exploited to model the dependence between *Y*<sup>1</sup> and *Y*2. The restriction of this approach is that the correlation is the same between all the series (in case one wants to model the systems beyond bivariate case) and the dependence can only be positive. We further discuss the trivariate reduction technique in the context of ACI/INGARCH models. In particular, the difficulties of extending this to higher dimensions is of interest and presents one with a challenging task.

#### *2.2. MACI (INGARCH)*

The Poisson integer-valued generalized autoregressive conditional heteroscedastic process (INGARCH) models [30]—also called multivariate autoregressive conditional intensity models (MACI) in the literature—are built upon GARCH framework and are capable of capturing time series properties of count data. As for GARCH-type models, it is assumed that the conditional mean of the process at time *t* depends on the value of the process at period *t* − 1 and its conditional mean at time *t* − 1. The time series of counts follow Poisson process with the conditional mean *λt*, that is,

$$X\_{i,t} \mid \mathcal{F}\_{t-1} \sim \text{Poisson}(\lambda\_{i,t}), \ t = 1, \ldots, n. \tag{3}$$

The corresponding joint pmf reads

$$P(\mathbf{X}\_{1t} = \mathbf{x}\_{1t}, \dots, \mathbf{X}\_{nt} = \mathbf{x}\_{nt} \mid \mathcal{F}\_{t-1}) = \prod\_{i=1}^{n} \frac{e^{-\lambda\_{it}} \lambda\_{it}^{\mathbf{x}\_{it}}}{\mathbf{x}\_{it}!}. \tag{4}$$

The dynamics of the conditional intensity *<sup>λ</sup><sup>t</sup>* <sup>=</sup> <sup>E</sup>[*Xt* <sup>|</sup> <sup>F</sup>*t*−1] follows

$$
\lambda\_t = \omega + \sum\_{i=1}^n A^i \lambda\_{t-i} + \sum\_{j=1}^q \mathbf{B}^j \mathbf{X}\_{t-j}.\tag{5}
$$

Note that the elements of *ω*, *a<sup>i</sup>* , *b<sup>j</sup>* are assumed to be positive to ensure the positivity of the intensity process *λ*. (Doukhan et al. [10] argue that the condition || *A* + *B* ||2< 1 guarantees stationarity.) In addition, we assume no contemporaneous correlation in the counts. Consider the bivariate case for the conditional intensity process

$$
\begin{bmatrix} \lambda\_{1t} \\ \lambda\_{2t} \end{bmatrix} = \begin{bmatrix} \omega\_1 \\ \omega\_2 \end{bmatrix} + \begin{bmatrix} a\_{11} & a\_{12} \\ a\_{21} & a\_{22} \end{bmatrix} \begin{bmatrix} \lambda\_{1t-i} \\ \lambda\_{2t-i} \end{bmatrix} + \begin{bmatrix} b\_{11} & b\_{12} \\ b\_{21} & b\_{22} \end{bmatrix} \begin{bmatrix} X\_{1t-j} \\ X\_{2t-j} \end{bmatrix}, t = 0, \pm 1, \pm 2, \dots \tag{6}
$$

From Equation (6) it is clear that when *A* and *B* are diagonal, there is no dependence structure between the intensities. Further, when *a*<sup>12</sup> = 0 and *b*<sup>12</sup> = 0 then the intensity of the first process, *λ*1,*t*, depends only on its own past while the second process can depend on the dynamics of the first one. Finally, if we restrict *A* to be diagonal and *B* to be nondiagonal, every intensity process would depend on its past and possibly on the past of

all of the observations. This constraint is relevant when one wants to apply graphical modeling to this problem.

#### *2.3. Quasi-Maximum Likelihood for MACI Models*

In this section, we discuss how the inference for MACI/INGARCH models can be executed. The details for the multivariate case have also been discussed in Doukhan et al. [10]. For these models, we make use of the classical estimation framework and in particular use quasi-maximum likelihood estimation. The conditional quasi-likelihood for this model and *θ* reads

$$L(\boldsymbol{\theta}) = \prod\_{t=1}^{T} \prod\_{i=1}^{n} \left( \frac{\exp(-\lambda\_{i,t}(\boldsymbol{\theta})) \lambda\_{i,t}^{x\_{i,t}}(\boldsymbol{\theta})}{\boldsymbol{\chi}\_{i,t}!} \right), \tag{7}$$

where *θ* are the parameters of interest. It follows the the quasi log-likelihood function is

$$l(\boldsymbol{\theta}) = \sum\_{t=1}^{T} \sum\_{i=1}^{n} (\boldsymbol{x}\_{i,t} \log \lambda\_{i,t}(\boldsymbol{\theta}) - \lambda\_{i,t}(\boldsymbol{\theta})),\tag{8}$$

and the corresponding score function reads

$$\begin{split} S\_T(\boldsymbol{\theta}) &= \sum\_{t=1}^T \sum\_{i=1}^n \left( \frac{\boldsymbol{x}\_{i,t}}{\lambda\_{i,t}} - 1 \right) \frac{\partial \lambda\_{i,t}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \\ &= \sum\_{t=1}^T \frac{\partial \lambda\_t^T(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} D\_t^{-1}(\boldsymbol{\theta}) (\boldsymbol{X}\_t - \lambda\_t(\boldsymbol{\theta})) \equiv \sum\_{t=1}^T \boldsymbol{s}\_t(\boldsymbol{\theta}), \end{split} \tag{9}$$

where *<sup>∂</sup>λt*/*∂θ<sup>T</sup>* is *<sup>n</sup>* <sup>×</sup> *<sup>d</sup>* matrix with *<sup>d</sup>* <sup>≡</sup> *<sup>n</sup>*(<sup>1</sup> <sup>+</sup> <sup>2</sup>*n*) being the dimension of the parameter vector *θ*, *Dt* is an *n* × *n* diagonal matrix and its diagonal elements are *λi*,*t*(*θ*), *i* = 1, 2, ... , *n*, and *Xt* consists of elements *xi*,*t*, *i* = 1, 2, ... , *n*, *t* = 1, 2, ... , *T*. Thus the recursions for the quasi-maximum likelihood estimation follow

$$\frac{\partial \lambda\_{\mathbf{f}}}{\partial \omega^{\sf T}} = I\_{\sf n} + A \frac{\partial \lambda\_{\mathbf{f}-\mathbf{1}}}{\partial \omega^{\sf T}},\tag{10}$$

$$\frac{\partial \lambda\_t}{\partial vec^T(A)} = (\lambda\_{t-1} \odot I\_n)^T + A \frac{\partial \lambda\_{t-1}}{\partial vec^T(A)},\tag{11}$$

$$\frac{\partial \lambda\_t}{\partial vec^T(\mathcal{B})} = (X\_{t-1} \odot I\_n)^T + A \frac{\partial \lambda\_{t-1}}{\partial vec^T(\mathcal{B})}.\tag{12}$$

Finally, the Hessian matrix and the conditional information matrix are correspondingly

$$H\_T(\boldsymbol{\theta}) = \sum\_{t=1}^T \sum\_{i=1}^n \frac{\boldsymbol{\chi}\_{i,t}}{\lambda\_{i,t}^2(\boldsymbol{\theta})} \frac{\partial \lambda\_{i,t}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \frac{\partial \lambda\_{i,t}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}^T} - \sum\_{t=1}^T \sum\_{i=1}^n (\frac{\boldsymbol{\chi}\_{i,t}}{\lambda\_{i,t}(\boldsymbol{\theta})} - 1) \frac{\partial^2 \lambda\_{i,t}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T},\tag{13}$$

$$\mathcal{G}\_{T} = \sum\_{t=1}^{T} \frac{\partial \lambda\_{t}^{T}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \mathcal{D}\_{t}^{-1}(\boldsymbol{\theta}) \boldsymbol{\Sigma}\_{t} \mathcal{D}\_{t}^{-1}(\boldsymbol{\theta}) \frac{\lambda\_{t}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}^{T}}.\tag{14}$$

Further, one can show that *Sn*(*θ*) = 0 has a unique solution, *θ*ˆ, which is strongly consistent and asymptotically normal. For further details of these properties, we refer the reader to Doukhan et al. [10]. However, that theoretical properties of *θ*ˆ are proven under assumption that the true value *θ*<sup>0</sup> belongs to the interior of the parameter space **Θ**. The problems certainly arise when the true parameter is close or on the boundary of the parameter space. Dealing with the theoretical problems of the constrained optimization and parameters near or on the boundary of parameter space is out of the scope of this paper and generally establishing the theory for this case is a complicated task. One of the possible solutions is to exploit bootstrap methods for this task, see Hilmer et al. [31] for a

comparison of some bootstrap methods related to this sort of problem and review of other possible approaches.

#### *2.4. Log-Linear Autoregressive Model*

Log-linear models have appeared in the count data literature in the recent years [10] and have good potential since they allow for both positive/negative correlation and avoid parameter boundary problems which MACI models suffer from.

$$X\_{i,t} \mid \mathcal{F}\_{t-1}^{X,\lambda} \sim \text{Poisson}(\lambda\_{i,t}),\tag{15}$$

$$\nu\_t = \omega + A\nu\_{t-1} + \mathcal{B}\log(\mathbf{X}\_{t-1} + \mathbf{1}\_n), \ t \ge 1,\tag{16}$$

where F*Y*,*<sup>λ</sup> <sup>t</sup>*−<sup>1</sup> is the *<sup>σ</sup>*−field generated by {*X*0, ... , *<sup>X</sup>t*, *<sup>λ</sup>*0}, **<sup>1</sup>***<sup>n</sup>* is the *<sup>n</sup>*-dimensional vector of ones, *ν<sup>t</sup>* ≡ log *λt*. Parameters of this model, *ω*, *A*, and *B*, do not have to be positive which makes this model more attractive than MACI.

#### *2.5. Quasi-Maximum Likelihood for Log-Linear Models*

The inference in log-linear models is very similar to the quasi-maximum likelihood approach derived for MACI models in Section 2.3. Only minor adjustments have to be made in corresponding recursions [10]. In particular, the score function for the log-linear model reads

$$S\_T(\boldsymbol{\theta}) = \sum\_{t=1}^T \sum\_{i=1}^n (\mathbf{x}\_{i,t} - \exp(\boldsymbol{\nu}\_{i,t}(\boldsymbol{\theta}))) \frac{\partial \boldsymbol{\nu}\_{i,t}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} = \sum\_{t=1}^T \frac{\partial \boldsymbol{\nu}\_t^T(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} (\mathbf{X}\_t - \exp(\boldsymbol{\nu}\_t(\boldsymbol{\theta}))), \tag{17}$$

the Hessian matrix is

$$H\_T(\boldsymbol{\theta}) = \sum\_{t=1}^T \sum\_{i=1}^n \exp(\nu\_{i,t}(\boldsymbol{\theta})) \frac{\partial \nu\_{i,t}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \frac{\partial \nu\_{i,t}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}^T} - \sum\_{t=1}^T \sum\_{i=1}^n (\mathbf{x}\_{i,t} - \exp(\nu\_{i,t}(\boldsymbol{\theta}))) \frac{\partial^2 \nu\_{i,t}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^T}, \tag{18}$$

and the conditional information matrix for the log-linear model reads

$$\mathcal{G}\_T(\boldsymbol{\theta}) = \sum\_{t=1}^T \sum\_{i=1}^n \exp\left(\boldsymbol{\nu}\_{i,t}(\boldsymbol{\theta})\right) \frac{\partial \boldsymbol{\nu}\_{i,t}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \frac{\partial \boldsymbol{\nu}\_{i,t}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}^T}. \tag{19}$$

Doukhan et al. [10] prove theoretical properties of this model. In particular, they show that there exists a unique solution *θ*ˆ which is strongly consistent and asymptotically normal. The authors also show that the condition ∑<sup>∞</sup> *<sup>j</sup>*=<sup>0</sup> || *<sup>A</sup><sup>j</sup> B* ||2< 1 guarantees both stationarity and weak dependence.

#### *2.6. Multivariate Poisson Distribution*

To allow for contemporaneous correlation, we need to use trivariate reduction technique discussed before. We consider the bivariate case to give an example, assume that there are three independent random variables *Y*1, *Y*2, *Y*<sup>3</sup> with positive means *λ*1, *λ*2, *λ*<sup>3</sup> respectively. Define random variables *X*<sup>1</sup> = *Y*<sup>1</sup> + *Y*<sup>3</sup> and *X*<sup>2</sup> = *Y*<sup>2</sup> + *Y*3. The new random variables will have means *λ*<sup>1</sup> + *λ*<sup>3</sup> and *λ*<sup>2</sup> + *λ*3, where *λ*<sup>3</sup> would also correspond to the covariance between *X*<sup>1</sup> and *X*2. The covariance is clearly restricted to be positive, while correlation will lie between 0 and min{ √ <sup>√</sup>*λ*1+*λ*<sup>3</sup> *λ*2+*λ*<sup>3</sup> , √ <sup>√</sup>*λ*2+*λ*<sup>3</sup> *λ*1+*λ*<sup>3</sup> }. Thereby the joint pmf of interest, alternative to what we have in Equation (3), becomes

$$\begin{split} P(\mathbf{X}\_{1t} = \mathbf{x}\_{1t}, \mathbf{X}\_{2t} = \mathbf{x}\_{2t} \mid \mathcal{F}\_{t-1}) &= e^{-(\lambda\_1 + \lambda\_2 + \lambda\_3)} \frac{\lambda\_1^{x\_1} \lambda\_2^{x\_2}}{\mathbf{x}\_1! \mathbf{x}\_2!} \\ &\times \sum\_{i=0}^{\min(\mathbf{x}\_1, \mathbf{x}\_2)} \binom{\mathbf{x}\_1}{i} \binom{\mathbf{x}\_2}{i} i! \left(\frac{\lambda\_3}{\lambda\_1 \lambda\_2}\right)^i. \end{split} \tag{20}$$

Extending this approach to contemporaneous correlation in higher dimensions is not trivial. Suppose that we would like to model *n* Poisson random variables, thus *Xi* ∼ Poisson(*λi*), *i* = 1, . . . , *n*. By defining a random variable *X*<sup>0</sup> ∼ Poisson(*λ*0) and extending the argument of the trivariate reduction we can define random variables *X*<sup>1</sup> = *Y*<sup>1</sup> + *Y*0, *X*<sup>2</sup> = *Y*<sup>2</sup> + *Y*0,..., *Xn* = *Yn* + *Y*0. The joint pmf is

$$\begin{split} P(\mathbf{X}\_1 = \mathbf{x}\_1, \mathbf{X}\_2 = \mathbf{x}\_2, \dots, \mathbf{X}\_n = \mathbf{x}\_n) &= \exp(-\sum\_{i=1}^n \lambda\_i) \prod\_{i=1}^n \frac{\lambda\_i^{\mathbf{x}\_i}}{\mathbf{x}\_i!} \\ &\times \sum\_{i=0}^m \sum\_{j=1}^n \binom{\mathbf{x}\_j}{i} i! \left(\frac{\lambda\_0}{\prod\_{i=1}^n \lambda\_i}\right)^i, \end{split} \tag{21}$$

where *m* = min(*x*1, *x*2, ... , *xn*). This approach assumes that the covariance is the same for all the pairs of Poisson random variables which is very restrictive. Karlis and Meligkotsidou [9] consider the case with richer covariance structure which we discuss further. For simplicity, assume we want to model trivariate time series of counts *Y*1,*Y*2,*Y*3. As before, let us specify through *Xi* and *Xij* univariate Poisson random variables, i.e., *Xi* ∼ Poisson(*λi*) and *Xij* ∼ Poisson(*λij*) with *i*, *j* ∈ {1, 2, 3}, *i* < *j*. Then the random variables *Yi* with *i* ∈ {1, 2, 3} are defined in the following way

$$\begin{aligned} Y\_1 &= X\_1 + X\_{12} + X\_{13\prime} \\ Y\_2 &= X\_2 + X\_{12} + X\_{23\prime} \\ Y\_3 &= X\_3 + X\_{13} + X\_{23\prime} \end{aligned} \tag{22}$$

Thus, *Yi* ∼ Poisson(*λ<sup>i</sup>* + *λij* + *λik*), where *i*, *j*, *k* ∈ {1, 2, 3}, *i* = *j* = *k*. Finally, these random variables follow the Poisson distribution with *λ* = (*λ*1, *λ*2, *λ*3, *λ*12, *λ*13, *λ*23), and hence with the mean vector **A***λ* = (*λ*<sup>1</sup> + *λ*<sup>12</sup> + *λ*13, *λ*<sup>2</sup> + *λ*<sup>12</sup> + *λ*23, *λ*<sup>3</sup> + *λ*<sup>13</sup> + *λ*23) . The variancecovariance matrix for this distribution is given by

$$A\Sigma A' = \begin{pmatrix} \lambda\_1 + \lambda\_{12} + \lambda\_{13} & \lambda\_{12} & \lambda\_{13} \\ \lambda\_{12} & \lambda\_2 + \lambda\_{12} + \lambda\_{23} & \lambda\_{23} \\ \lambda\_{13} & \lambda\_{23} & \lambda\_3 + \lambda\_{13} + \lambda\_{23} \end{pmatrix} \tag{23}$$

It is clear from the above examples that the modeling of the time series of counts with multivariate Poisson distribution in higher dimensions is restrictive and cumbersome. It is restrictive, since it allows only for positive dependency in the data, which can be unreasonable for real-world applications. It is cumbersome since the method is only computationally tractable when one has low counts data, see Equation (21) in which the number of terms in the sum depends on the number of observed counts. Methods such as expectation maximization can be applied in this case but they are not trivial and stable in case of high counts. Moreover, in this case, incorporation of the multivariate Poisson distribution into MACI or log-linear models also affects computational speed substantially, and these models lose their attractiveness over more complex models such as nonlinear state-space models in the next section.

#### **3. Parameter Driven Model: Nonlinear State-Space Model**

The advantage of parameter driven models is the clear interpretability of the model parameters and the high degree of flexibility. The model can easily incorporate different distributions and extends easily to the multivariate framework. Moreover, in the Bayesian framework, we have coherent inferential tools derived from the posterior distributions of the parameters, such as highest posterior density intervals. These models also take into account uncertainty about the parameters which is incorporated into predictions. The disadvantage of this approach is challenging estimation procedures that are computationally intensive. Hence, even though theoretically estimation methodologies are possible to extend to any dimension, in practice that is not feasible due to the time constraints. In this paper, we estimate the parameter-driven model for multivariate count data using Sequential Monte Carlo and particle Markov Chain Monte Carlo. These methods became popular with the availability of more computational power. They are restricted in some ways, and we will discuss these restrictions in the next subsections after introducing the nonlinear state space model (SSM), which we will compare to the observation driven models.

#### *3.1. Multivariate SSM*

A state-space model is usually presented by an observation equation and a state equation. The state equation represents a latent process, say *ht*, which drives the dynamics of observations *yt*. In the multivariate SSM for count data below, this dependence between the observations and the state is nonlinear

$$X\_{it} \sim \text{Poisson}(\lambda\_{it}), \ t = 1, 2, \dots, n \tag{24}$$

$$
\lambda\_t = \beta \exp(h\_t) \tag{25}
$$

$$\mathbf{h}\_{\mathbf{t}} = \sum\_{i=1}^{n} \Phi^{i} \mathbf{h}\_{\mathbf{t}-i\mathbf{i}} + \eta\_{\mathbf{t}}, \qquad \boldsymbol{\Sigma}\_{\eta} = \begin{pmatrix} \sigma\_{\eta\_{1}}^{2} & \rho\_{\eta\_{12}} & \dots & \rho\_{\eta\_{1n}} \\ \rho\_{\eta\_{21}} & \sigma\_{\eta\_{22}}^{2} & \dots & \rho\_{\eta\_{2n}} \\ \vdots & \vdots & \ddots & \vdots \\ \rho\_{\eta\_{n1}} & \dots & \dots & \sigma\_{\eta\_{n'}}^{2} \end{pmatrix} \tag{26}$$

where *η<sup>t</sup>* ∼ *N*(0, Σ*η*). Equation (24) shows that the observations have Poisson distribution with mean *λ<sup>t</sup>* defined through the Equation (25), and *λ<sup>t</sup>* nonlinearly depends on the latent process *ht* which is defined through Equation (26). Note that the latent process is defined through a VAR(*p*) process, and hence corresponding theory applies. In particular, the stationarity condition is that the roots of the Equation (27) must lie outside the unit circle

$$\left| \, \lambda^p I\_n - \lambda^{p-1} \Phi\_1 - \dots \Phi\_p \right| = 0. \tag{27}$$

The dependence structure between counts is modeled through the dependence in the latent process. Conditioned on the latent process {*ht*}*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> the observations {*Xt*}*<sup>T</sup> t*=1 are independent. Furthermore, since the latent process of the model is a VAR(*p*), we can account for various dependence structures: positive and negative contemporaneous correlation, and positive and negative Granger-causal feedback.

These models are challenging to estimate, and an assumption of *p* = 1 can simplify the inference. (For extending the model to *p* > 1 we advise the reader to consider using sparse priors, such as Minnesota prior, spike and slab or horseshoe prior.) Bivariate specification of the nonlinear state-space model with the lag *p* = 1 reads

$$X\_{\rm it} \sim \text{Poisson}(\lambda\_{\rm it}), \ i = 1, 2 \tag{28}$$

$$
\lambda\_{it} = \beta\_i \exp(h\_{it}), \text{ } i = 1, 2. \tag{29}
$$

$$
\begin{pmatrix} h\_{1,t+1} \\ h\_{2,t+1} \end{pmatrix} = \begin{pmatrix} \phi\_{11} & \phi\_{12} \\ \phi\_{21} & \phi\_{22} \end{pmatrix} \begin{pmatrix} h\_{1,t} \\ h\_{2,t} \end{pmatrix} + \begin{pmatrix} \eta\_{1t+1} \\ \eta\_{2t+1} \end{pmatrix}, \qquad \Sigma\_{\eta} = \begin{pmatrix} \sigma\_{\eta\_1}^2 & \rho\_{\eta\_{12}} \\ \rho\_{\eta\_{21}} & \sigma\_{\eta\_2}^2 \end{pmatrix} \tag{30}
$$

The dependence structure between series is described by the Granger-causal relationship in the latent processes *hit* and contemporaneous relations that are incorporated in Σ*η*. For example, we say that *h*2,*<sup>t</sup>* does not Granger-cause *h*1,*<sup>t</sup>* if *φ*<sup>12</sup> = 0. Correlation coefficient *ρ* in this model allows us to model both positive and negative correlation between the counts.

#### *3.2. Bayesian Inference in Multivariate SSM*

The estimation of nonlinear state-space models naturally fits into the Bayesian framework. The presence of the unobservable process in the model and nonlinear dependence

of the observations on this unobservable process leads to an intractable likelihood and posterior. For this reason, and due to the nonlinear SSM structure, we use particle Markov Chain Monte Carlo (PMCMC) for the estimation of the posterior distribution of the model parameters Andrieu et al. [21]. The method consists of two parts. First, the likelihood is estimated in a sequential manner through a *particle filter*. Second, this estimate is used within an MCMC sampler, in our case Metropolis-Hastings algorithm. An extensive introduction to nonlinear state-space models and particle filtering can be found, for example, in Särkkä [32].

Recall the Bayes rule on which the inference is based

$$p(h\_{1:T}|\mathbf{x}\_{1:T}) = \frac{p(\mathbf{x}\_{1:T}|h\_{1:T})\pi(h\_{1:T})}{p(\mathbf{x}\_{1:T})},\tag{31}$$

where *π*(*h*1:*T*) is the prior distribution on the parameters of the volatility process defined by the dynamic model, *p*(*x*1:*T*|*h*1:*T*) is the likelihood of the observations, *p*(*x*1:*T*) is the normalization constant that is ignored during the inference. Thus, we use Bayes rule in proportionality terms

$$p(h\_{1:T}|x\_{1:T}) \propto p(x\_{1:T}|h\_{1:T})\pi(h\_{1:T}).\tag{32}$$

We use particle Metropolis-Hastings to estimate the posterior distribution of the parameters of the model since neither the likelihood nor the posterior are available in closed form. We use Metropolis-Hastings algorithm to sample from the posterior of the parameters. Algorithm 1 presents an iteration of the Metropolis-Hastings algorithm. At every iteration of the algorithm we make a new proposal *θ<sup>c</sup> <sup>i</sup>* for the parameter vector using a proposal mechanism *<sup>q</sup>*(·|*θ*(*i*)). Then we accept the proposed candidate *<sup>θ</sup><sup>c</sup> <sup>i</sup>* with probability *α*. The acceptance probability in Algorithm 1 depends on *p*(*θ*, *h*1:*T*|*x*1:*T*) target distribution—and *q*(·)—proposal distribution. How well we manage to explore the posterior distribution depends on the acceptance rate of the algorithm. When the acceptance rate is too high it is often related to too small proposal steps, and the other way around. Overall, either case slows down the convergence of the Markov Chain. General advice for the optimal performance of the algorithm is an acceptance rate that is around 0.234 [33].

#### **Algorithm 1** Particle Metropolis-Hastings Algorithm

1: Given *θ*(*i*), 2: Generate *θ<sup>c</sup> <sup>i</sup>* <sup>∼</sup> *<sup>q</sup>*(·|*θ*(*i*)), 3: Take *θ*(*i*+1) = *θc <sup>i</sup>* , with probability *<sup>ρ</sup>*(*θ*(*i*), *<sup>θ</sup><sup>c</sup> i* ) *<sup>θ</sup>*(*i*) with probability 1 <sup>−</sup> *<sup>ρ</sup>*(*θ*(*i*), *<sup>θ</sup><sup>c</sup> i* ), where *ρ*(*θ*(*i*) , *θc <sup>t</sup>*) = min\$ *<sup>p</sup><sup>c</sup> θi* (*x*1:*T*) *p* (*i*) *<sup>θ</sup>* (*x*1:*T*) *π*(*θ<sup>c</sup> i* ) *π*(*θi*) *<sup>q</sup>*(*θ*(*i*)|*θ<sup>c</sup> i* ) *q*(*θ<sup>c</sup> <sup>i</sup>* <sup>|</sup>*θ*(*i*)) , 1%

Using Algorithm 1 we obtain samples from otherwise intractable distribution *<sup>p</sup>*(*θ*, *<sup>h</sup>*1:*T*|*x*1:*T*). Note, that *<sup>p</sup><sup>c</sup> θi* (*x*1:*T*) and *p* (*i*) *<sup>θ</sup>* (*x*1:*T*) are also intractable. Thus, in practice we substitute them with the estimates *p*ˆ*θ<sup>c</sup> i* (*x*1:*T*) and *p*ˆ*θ*(*i*)(*x*1:*T*) obtained with Sequential Monte Carlo.

We further discuss how *p<sup>θ</sup>* (*x*1:*T*) can be estimated.

#### *3.3. Estimation of the Likelihood with SMC*

Sampling from the posterior distribution with algorithms such as Metropolis-Hastings requires evaluationg the likelihood. In case of non-linear state-space models, this likelihood evaluation is not straightforward since the likelihood is a high dimensional integral

$$\begin{split} L(\mathbf{x}\_{1:T}) &= \int p(\mathbf{x}\_{1:T}, h\_{1:T}) dh\_{1:T} = \int p(\mathbf{x}\_{1:T} | h\_{1:T}) p(h\_{1:T}) dh\_{1:T} \\ &= \int p(\mathbf{x}\_{1} | h\_{1}) p(h\_{1}) \prod\_{t=2}^{T} p(\mathbf{x}\_{t} | h\_{t}) p(h\_{t} | h\_{t-1}) dh\_{1} \dots h\_{T} \end{split} \tag{33}$$

and this likelihood is not analytically tractable. Instead of relying on an analytical result, the integral from Equation (33) can be approximated using Sequential Monte Carlo methods, also known as particle filters. This estimate of the likelihood is then used in Algoperithm 1 as *p*ˆ*<sup>θ</sup>* (*x*1:*T*). Algorithm 2 represents a simple version of a particle filter which we use in this paper. The algorithm consists of three main steps: prediction, updating and resampling. In the prediction step we sample *N* particles according to the assumed dynamics of the latent process *p*(*ht*|*ht*−1). Then we weight each particle according to the distribution of the data given the latent state *p*(*xt*|*ht*). Finally, in the resampling step we resample the particles according to these weights. Resampling step is meant to solve the known problem of particle degeneracy: without resampling we end up only with a few particles with high weights over time.

#### **Algorithm 2** Bootstrap Particle Filter with resampling

1: Draw a new point *h* (*i*) *<sup>t</sup>* for each point in the sample set {*h* (*i*) *<sup>k</sup>*−<sup>1</sup> : *<sup>i</sup>* <sup>=</sup> 1, ... , *<sup>N</sup>*} from the dynamic model:

$$h\_t^{(i)} \sim p(h\_t | h\_{t-1}^{(i)}), \ i = 1, \ldots, N.$$

2: Calculate the weights

$$
\omega\_t^{(i)} \sim p(\mathbf{x}\_t | h\_t^{(i)}), \ , \ 1, \dots, N\_{\prime}.
$$

and normalize them to sum to unity.

3: Compute the estimate of *p*(*xt*|*x*1:*t*−1, *θ*) as

$$p(\mathbf{x}\_t | \mathbf{x}\_{1:t-1}, \boldsymbol{\theta}) = \sum\_i \log \omega\_t^{(i)}.$$

Perform resampling:


The particle filter provides us with the sequence of distributions *p*(*ht*|*xt*), however due to particle degeneracy problem discussed previously, sampling from *p*(*h*1:*T*|*x*1:*T*) and approximating *p*(*hk*|*x*1:*T*), *k* = 1, ... , *T*, is inefficient. One of the possible solutions to this problem is using so called forward filtering - backward smoothing recursions [34]. The algorithms starts with sampling *h*∗ *<sup>T</sup>* ∼ *p*ˆ(*hT*|*x*1:*T*), and then backwards for *k* = *T* − 1, *T* − 2, . . . , 1, we sample *h*∗ *<sup>k</sup>* ∼ *p*ˆ(*hk*|*h*<sup>∗</sup> *<sup>k</sup>*+1, *x*1:*n*) . Then we can approximate the distribution *p*ˆ(*hk*|*x*1:*T*) as follows

$$\begin{split} \boldsymbol{\upbeta}(h\_{k}|\mathbf{x}\_{1:T}) &= \sum\_{i=1}^{N} \boldsymbol{\upbeta}\_{k}^{i} \times \left[ \sum\_{j=1}^{N} \boldsymbol{\upbeta}\_{k+1|T}^{j} \frac{f(\boldsymbol{h}\_{k+1}^{\*,j}|\boldsymbol{h}\_{k}^{\*,i})}{\left[\sum\_{l=1}^{N} \boldsymbol{\upbeta}\_{k}^{l} f(\boldsymbol{h}\_{k+1}^{\*,j}|\boldsymbol{h}\_{k}^{\*,l})\right]} \boldsymbol{\updelta}\_{h\_{k}^{\*,i}}(h\_{k}) \right] \\ &= \sum\_{i=1}^{N} \boldsymbol{\upbeta}\_{k|T}^{i} \boldsymbol{\updelta}\_{h\_{k}^{\*,i}}(h\_{k}). \end{split} \tag{34}$$

The smoothing comes at cost of *O*(*NT*) operations to sample a path from *p*(*h*1:*T*|*x*1:*T*) and *<sup>O</sup>*(*N*2*T*) operations to approximate *<sup>p</sup>*(*hk*|*x*1:*T*). This method works very well, in particular when dealing with large sample sizes. However, its performance comes at the price of a high computational costs. Thereby, it is generally recommended to use it when the sample size of the data is large and hence Sequential Monte Carlo is more likely to suffer from particle degeneracy. There exist other methods that are computationally less expensive [34]. However, in higher dimensions, they would be less reliable, and it would be recommendable to use more expensive methods.

#### *3.4. Forecasting with SSM*

One of the interests of statistical inference is the ability to perform forecasting exercises and thus handling the uncertainty about the future in the best possible way. In this section, we will discuss how forecasting task fits into the Bayesian framework, and in particular how it can be done for models of our interest.

Recall that we estimated multivariate SSM model for count data in the Bayesian framework. Observing the data *x* = (*x*1, ... , *xT*) we estimated the posterior distribution of the parameters in our model using particle Markov Chain Monte Carlo methods *p*(*θ*|*x*). Suppose that we are interested in predicting the next *s* observations, that is, *xT*+1, ... , *xT*+*s*. First, note that the prediction equation for the next step reads

$$p(\mathbf{x}\_{t+1}|\mathbf{x}\_t) = \int p(\mathbf{x}\_{t+1}|h\_{t+1}) p(h\_{t+1}|\mathbf{x}\_t) dh\_t. \tag{35}$$

In the framework of particle Markov Chain Monte Carlo, it is natural to adopt a sequential nature of SMC and the fact that we obtain posterior draws in the MCMC part of the algorithm. Thereby, for every vector *θ* of the parameters drawn in the MCMC, we can propagate the particles obtained at time *T* and based on those make one-step ahead forecast. The similar idea extends to *s*-steps ahead forecasts. In this case the uncertainty about the parameters is included in the forecasts.

When forecasting, the most natural but cumbersome approach would be to update the posterior distribution every time we get a new observation. It would mean that we generate as many MCMC chains as we have steps for forecasting. This can be carried out in a straightforward way by re-estimating the posterior distribution every time or more efficiently by incorporating this into the SMC framework. However, for large enough samples, adding an extra estimation into the PMCMC framework should not change the results substantially. Ignoring this update also makes the forecasting performance of the frequentist and Bayesian approaches more comparable. Both frameworks are estimated in different paradigms. While SSM naturally fits into the Bayesian paradigm, the log-linear model is usually estimated using frequentist methods (quasi-maximum likelihood in this case). Since our goal is not to compare the two approaches to statistics, this design of forecasting exercise is more fair.

Figure 2 illustrates the forecasting approach we undertake with the state-space model in a graphical representation. In particular, one can see that we do not re-estimate the posterior distribution every time we receive a data point.

**Figure 2.** Visual representation of forecasting with the state-space model for count data.

#### **4. Model Comparison and Prediction Assessment**

We next summarize the measures using which we compare the models in Sections 2 and 3. Observation driven models in this comparison are represented by the log-linear autoregressive model. The log-linear autoregressive model is more flexible than the MACI model since it can account for a negative correlation and thus it is a fairer competitor. The parameter driven approach is the state-space model, where observations are generated from the Poisson distribution and dependency is modeled through a latent process. Note that, for the latter framework, we follow a fully Bayesian approach. Thereby, we compare these two classes of models by model fit and forecasting performance criteria. The standard measures to access the model fit and forecast accuracy would be Mean Squared Error (MSE) and Mean Absolute Error (MAE) defined in Equation (36) respectively.

$$MSE = \frac{1}{s} \sum\_{i=1}^{s} (\mathbf{x}\_i - \mathbf{x}\_i^\*)^2, \qquad MAE = \frac{1}{s} \sum\_{i=1}^{s} |\mathbf{x}\_i - \mathbf{x}\_i^\*|^2. \tag{36}$$

In Czado et al. [35] the authors propose comparing forecast performance using some scoring rules. To define scoring rules, let *P* be the predictive distribution and *x* be the counts; then the penalty is defined through *s*(*P*, *x*). Table 1 presents some of the scoring rules one can use for comparing the performance of count data models.

**Table 1.** Scoring rules for assessment of the forecasts. The table summarizes scoring rules that we use to assess forecasting performance of the models under consideration, proposed in Czado et al. [35] for count data.


Note, that in practice, one calculates the mean score

$$S = \frac{1}{n} \sum\_{i=1}^{n} s(P^{(i)}, \mathbf{x}^{(i)}). \tag{37}$$

To compare our results with the conclusions in Zhang et al. [19] we also report Dawid-Sebastiani (DS) score which is defined in Equation (38)

$$DSS\_{t,i}(X\_{t,i}) = \frac{X\_{t,i} - \mu\_{t,i}}{\sigma\_{t,i}} + 2\log(\sigma\_{t,i}).\tag{38}$$

#### **5. Simulation Examples**

In this section, we demonstrate the performances of the models based on simulated data. We generate data from various specifications of SSM and log-linear MACI models and compare the models on forecasting performance. We assess forecasting performance based on six different scoring measures discussed in the previous section. The design of the simulation study allows us to assess forecasting performance in the cases of both correct model specification and misspecified case. Table 2 summarizes three different specifications of the state-space approach for data generation and Table 3 summarizes specifications of the log-linear MACI for data generation. Figure 3 illustrates two examples of bivariate time series generated from these models. For each simulation setting, we generate ten datasets with different random seeds and report the average results from these ten datasets. State-space model was estimated using particle Metropolis-Hastings algorithm with *N* = 5000 particles and *M* = 20000 Metropolis-Hastings step with a warmup period of 5000 steps. The acceptance rate was targeted to be between 25% and 40%.

**Table 2.** True parameters for the data sets generated from the state-space model in the simulation examples. All data generating processes include a one-directional Granger-causal feedback through a non-zero coefficient *φ*<sup>21</sup> and different correlation structures: *SSM*<sup>1</sup> has a positive correlation coefficient *ρ*, *SSM*<sup>2</sup> has a negative correlation coefficient *ρ* and *SSM*<sup>3</sup> has no correlation.


**Table 3.** True parameters for the data sets generated from the log-linear MACI model in the simulated examples.


We assess the forecasting performance of two models for multivariate count data: statespace model and log-linear model. Tables 4 and 5 summarize the forecasting performances of the models according to various scoring rules. The rows of the tables correspond to a particular data generating process (see for details of specification Tables 2 and 3) and columns for a particular scoring rule (see scoring rules specification in Table 1). In particular, Table 4 shows performance of the state-space model and Table 5 the performance of the log-linear model. The state-space model outperforms the log-linear MACI model when the data are generated from *SSM*<sup>1</sup> (SSM with positive correlation) and *LL*<sup>1</sup> (log-linear model with a negative *a*<sup>11</sup> coefficient). It is particularly interesting that when the data is simulated from *LL*1, SSM performs best according to all measures despite being a misspecified model. When the data are generated from *SSM*<sup>2</sup> and *SSM*3, the state-space approach performs best based on most measures. This result is expected as SSM is the correct model specification for these simulated data. Finally, log-linear MACI model performs best in the case of data set *LL*2—in the case when the model is correctly specified and all the coefficients are positive—according to most measures.

**Figure 3.** Examples of the data generated with the state-space and log-linear MACI models. (**a**) Dimension 1 of the bivariate time series generated from SSM2 in Table 2. (**b**) Dimension 2 of the bivariate time series generated from SSM2 in Table 2. (**c**) Dimension 1 of the bivariate time series generated from LL2 in Table 2. (**d**) Dimension 2 of the bivariate time series generated from LL2 in Table 2.

**Table 4.** Scores for the forecasting exercise with the state-space model, according to the definitions in Table 1. The smaller score indicates a better result. DGP column corresponds to the data generating processes in this simulation study, the true parameters are presented in Tables 2 and 3.


**Table 5.** Scores for the forecasting exercise with the log-linear MACI model. The smaller score indicates a better result. DGP column corresponds to the data generating processes in this simulation study, the true parameters are presented in Tables 2 and 3.


#### **6. Empirical Applications**

In this section, we compare the models in two empirical applications—bank failures and transactions data. These data sets exhibit different sample sizes, standard deviation, dispersion and maximum counts. In particular, bank failure time series reach a maximum of 10 and 24 counts while transaction data reach up to 67 and 60 counts with comparable mean counts.

#### *6.1. Bank Failures*

Bank failures have been analyzed using a univariate Poisson process [36]. A number of researches have investigated bank failure data of different time periods, see e.g., Schoenmaker [1] for an analysis of contagion risk in banking. Overall, it is reasonable to expect that bank failures in different countries are driven by similar economic phenomena, and possible contagion/spillover effects exist between economies of different countries.

For this application, we analyze count data using a bivariate data set of bank failures in the U.S. and Russia that has not been considered in the literature before. We use monthly number of bank failures for the period between January 2008 and December 2012 for both countries and apply the bi-variate specifications of the models in Sections 2 and 3. Especially due to the global financial crisis included in this period, it is important to allow for potential correlation between the number of bank failures in the U.S. and Russia using the multivariate count data models. Figure 4 illustrates these time series and Table 6 presents descriptive statistics for this data set.

**Figure 4.** Data for bank failures empirical application. (**a**) Monthly bank failures in Russia January 2008–December 2012. (**b**) Monthly bank failures in the U.S. January 2008–December 2012. (**c**) Scatter plot of data bank failures in subplots (**a**,**b**).


**Table 6.** Descriptive statistics for the bank failures data for the period January 2008 until December 2012 for Russia and U.S.

The estimation results from both models are presented in Tables 7 and 8. Based on the state-space model, the correlation is estimated as being low negative and 0 is included in highest posterior density interval for this parameter. Despite that log-linear MACI model estimates correlation coefficient to be positive, it provides a large confidence interval for this parameter which also includes 0. Thus, for this relatively small data set, we do not find an indication of correlated bank failures using both models. We also note that some confidence intervals in Table 8 include point 0. As discussed in Section 2, applying observation driven models with positivity constraints would be problematic for these data especially in terms of the calculation of confidence intervals.

**Table 7.** Posterior moments of the parameters of the state-space model for bank failures.


**Table 8.** Parameter estimates of the log-linear MACI model for bank failures.


We next compare the models in terms of their forecast performances. For this comparison, we take a sample size of *T* = 55, and we make five steps ahead predictions using the log-linear model and the state-space approach. Table 9 presents scores for this forecasting exercise. Based on all scores, except for the rank probability score (rps), the state-space model outperforms the log-linear model in terms of forecasting. Based on the simulation results in Section 5, we conjecture the following: The better performance of the state-space model can be due to this model being close to the true data generating process, or due to its property of capturing data properties well even if it is misspecified.


**Table 9.** Scores for the forecasting exercise with bank failure data. This table shows scoring rules for the forecasting exercise in the bivariate model for bank failure data.

#### *6.2. Transactions*

In this empirical application, we analyze the number of transactions on 30 s intervals for Deutsche Bank AG and Bayer AG (the datawere obtained from FactSet, in the time period of 3 August 2015 09:05:30 until 3 August 2015 12:25:00 for the training data). We expect such transactions to be correlated due to their dependence on the time of the day and the market conditions. The sample size in this application is *T* = 400, which is significantly larger than the sample size in the bank failures application. The summary statistics for this data set are provided in Table 10 and Figure 5 illustrates these time series. Both time series have fat tails with a few very high values, concentrated around observation 100 and 1 for Deutsche Bank and Bayer AG, respectively.

**Figure 5.** Data for transactions empirical application. (**a**) Transactions in 30 s. interval for Deutsche Bank AG. (**b**) Transactions in 30 sec. interval for Bayer AG. (**c**) Scatter plot of transactions in (**a**,**b**).

**Table 10.** Descriptive statistics for the transactions data.


We apply the bi-variate counterparts of the count data models in Sections 2 and 3 to these data and compare the model performances based on 100 steps ahead forecasts. In Tables 11 and 12 we present parameter estimates of both models. Both models estimate positive correlation between these time series. However, in the case of the log-linear MACI model the estimated correlation coefficient is much higher. In addition, the confidence intervals of parameter estimates such as *b*<sup>12</sup> and *b*<sup>22</sup> in Table 12 include point 0. Thus, true parameters being non-positive is a potential problem if other observation driven models, with positivity constraints, in Section 2 were applied to these data.

Table 13 presents the scores of each model in the forecast sample. In this application, the log-linear model performs best according to all scoring rules. Based on the simulation results in Section 5, we conjecture that the log-linear model is potentially closer to the true data generating process compared to the state space model. We further analyze the forecasting performances of the models in Figure 6. Particularly in Figure 6b, we observe that the log-linear MACI model captures better high spikes of the counts and then returns to the original level of the data. The forecast with the state-space model appears to be too smooth compared to the data points. Thus, the better forecast performance of the log-linear MACI is potentially due to its ability to capture these extreme data values successfully.

The under-performance and over-smoothing of the state-space approach can be mitigated by implementing a different particle filter. For example, one could take the direction of implementing look-ahead particle filters such as [37,38]. General idea of the look-ahead approaches is that in the particle filtering algorithm we make a proposal not just according to the dynamics of the model *p*(*ht*|*ht*−1), but taking the current observation into account *p*(*ht*|*ht*−1, *yt*) or taking into account the complete time series *p*(*ht*|*ht*−1, *y*1:*T*) as in [38]. These methods, however, have not been developed for the distributional assumptions we are considering in this paper and further research is needed in this direction.


**Table 11.** Posterior moments of the para maters for the state space model for transactions.

**Table 12.** Parameter estimates of the log-linear MACI model for transactions.



**Table 13.** Scores for transaction forecasts.

**Figure 6.** Forecasts and true data for the transactions empirical application. (**a**) Transactions in 30 s. interval for Deutsche Bank AG (black circles), forecast with SSM (green), forecast with log-linear MACI (red). (**b**) Transactions in 30 s. interval for Bayer AG (black circles), forecast with SSM (green), forecast with log-linear MACI (red).

#### **7. Discussion**

In this paper, we have reviewed and compared two approaches for modeling multivariate count time series data. One of the challenges that appears in the literature and have not been resolved is modeling the dependency between counts in a flexible way that would also allow for feasible estimation. We have discussed multivariate autoregressive conditional intensity models (MACI), their log-linear alternative which we refer to as the multivariate log-linear model and nonlinear state-space model. Both models have advantages and disadvantages. In particular, the nonlinear state-space framework allows for various interpretable dependencies that one cannot easily incorporate into MACI or log-linear approach. However, these models can be computationally expensive to estimate, in particular in higher-dimensions. Challenges in estimation arise from different sources. State-space models naturally fit into the Bayesian framework, however, since both the likelihood and the posterior of the model are analytically intractable this leads to computationally expensive procedure. MACI models, on the other hand, are quite restrictive: they restrict coefficients in the model to be positive as well as the correlation between time series. Both assumptions can be unrealistic in many real-world applications. Log-linear model avoids the problem of only positive coefficients in the model by logarithmic transformation of the data. However, estimation can be unstable, and good starting points need to be chosen for the estimation. When the dimension of the model grows, it becomes harder to choose good starting points for the optimization problem. The computational advantage of log-linear and MACI models decreases with the increase in either dimensionality of the model or the number of counts. This reduction in the computational advantage is due to the usage of the multivariate Poisson distribution as every pairwise correlation has to be modeled as a separate Poisson random variable. Moreover, the summation in the specification of the joint distribution runs through the number of counts. Generally, one could say that estimation of log-linear models much faster than of the state-space models. In low dimensions and with the small number of counts these models do not require much of computational power, however, once the number of counts increases and once we deal

with higher dimensions, the computations become much more extensive due to large sums in the multivariate Poisson distribution. Moreover, while running the model on simulated and empirical data, we found that the estimation can be numerically unstable and can highly depend on the starting values in the estimation procedure. We follow the suggestion of Doukhan et al. [10], and the first estimate the model for univariate time series. These estimates we further use in multivariate estimation. However, the problem of numerical instability especially remains in small samples according to our experience. Nevertheless, in terms of flexibility, this model is the best competitor for the state-space approach.

We have compared log-linear models and state-space models for count data in terms of forecasting performance on multiple simulated data sets and real data applications. We found that on the simulated data state-space framework generally outperforms loglinear model, sometimes even under model misspecification. On the real data sets, the state-space model performs better in bank failures applications which consists of two time series of bank failures in Russia and U.S. and the counts remain relatively low and the data are relative smooth. The log-linear model performs better in the transactions applications in which we consider two time series of transactions counts in 30 seconds intervals. The challenge of transactions application is that there are spikes of counts which deviate a lot from the mean. In this case, we notice that the log-linear model approximates these spikes better. Thus, a possible direction for future research is adapting a multivariate state-space model for count data to capture such spikes better. A possible way to improve the model in this regard would be to adapt the particle filtering algorithm. We used bootstrap particle filter which does not take into account observations when making a proposal for particles, but taking current (or all) observation into account in the proposal mechanism for the particles can help approximating the spikes in the data. There have been proposed multiple look-ahead approaches for particle filters [37,38], but they have not been adapted to count data.

Finally, both approaches have their drawbacks. In particular, the log-linear model seems to have numerical stability issues and finding optimal starting values for optimization can be a challenge. In the state-space approach, the challenging part is the estimation of the likelihood, which is intractable and sampled from the posterior distribution. Additionally, the state-space model in its current implementation is challenged by possible spikes in the data to a larger degree than the log-linear model.

**Author Contributions:** Conceptualization, Y.S., N.B., and M.E.; methodology, Y.S. and M.E.; formal analysis, Y.S.; investigation, Y.S.; writing—original draft preparation, Y.S., N.B., and M.E.; writing review and editing, Y.S. and N.B.; visualization, N.B. and Y.S.; supervision, M.E. All authors have read and agreed to the published version of the manuscript.

**Funding:** N.B. is partially supported by an NWO grant VI.Vidi. 195.187.

**Data Availability Statement:** The data for Polio and Asthma cases is publicly available from *R* library glarma [39]. Number of car crashes previously published in [40]. The earthquakes data is available from https://earthquake.usgs.gov/ (accessed on 14 June 2017). Transactions data can be downloaded from FactSet. The data for bank failures is publicly available at www.banki.ru (Russia) (accessed on 2 June 2021) and www.fdic.gov (U.S.) (accessed on 2 June 2021).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**

