*Article* **The Stochastic Stationary Root Model**

#### **Andreas Hetland**

Department of Economics, University of Copenhagen, 1353 Copenhagen K, Denmark; lxs601@ku.dk; Tel.: +45-20422606

Received: 31 March 2018; Accepted: 13 August 2018; Published: 21 August 2018

**Abstract:** We propose and study the stochastic stationary root model. The model resembles the cointegrated VAR model but is novel in that: (i) the stationary relations follow a random coefficient autoregressive process, i.e., exhibhits heavy-tailed dynamics, and (ii) the system is observed with measurement error. Unlike the cointegrated VAR model, estimation and inference for the SSR model is complicated by a lack of closed-form expressions for the likelihood function and its derivatives. To overcome this, we introduce particle filter-based approximations of the log-likelihood function, sample score, and observed Information matrix. These enable us to approximate the ML estimator via stochastic approximation and to conduct inference via the approximated observed Information matrix. We conjecture the asymptotic properties of the ML estimator and conduct a simulation study to investigate the validity of the conjecture. Model diagnostics to assess model fit are considered. Finally, we present an empirical application to the 10-year government bond rates in Germany and Greece during the period from January 1999 to February 2018.

**Keywords:** cointegration; particle filtering; random coefficient autoregressive model; state space model; stochastic approximation

**JEL Classification:** C15; C32; C51; C58

#### **1. Introduction**

In this paper, we introduce the multivariate stochastic stationary root (SSR) model. The SSR model is a nonlinear state space model, which resembles the Granger-Johansen representation of the cointegrated vector autoregressive (CVAR) model, see *inter alia* Johansen (1996) and Juselius (2007). The SSR model decomposes a *p*–dimensional observation vector into *r* stationary components and *p* − *r* nonstationary components, which is similar to the CVAR model. However, the roots of the stationary components are allowed to be stochastic; hence the name 'stochastic stationary root'. The stationary and nonstationary dynamics of the model are observed with measurement error, which in this model prohibits close-form expressions for e.g., the log-likelihood, sample score and observed Information matrix. Likelihood-based estimation and inference therefore calls for non-standard methods.

Although the SSR model resembles the CVAR model, it is differentiated by its ability to characterize heavy-tailed dynamics in the stationary component. Heavy-tailed dynamics, and other types of nonlinear dependencies, are not amenable to analysis with the CVAR model, which has prompted work into nonlinear alternatives, see *inter alia* Bohn Nielsen and Rahbek (2014), Kristensen and Rahbek (2013), Kristensen and Rahbek (2010), and Bec et al. (2008). Similarly, cointegration in the state space setting has been considered in term of the common stochastic trend (CST) model by Chang et al. (2009) as well as the CVAR model with measurement errors by Bohn Nielsen (2016). Additionally, the SSR model is also related to the stochastic unit root literature, see *inter alia* Granger and Swanson (1997), Leybourne and McCabe (1996), Lieberman and Phillips (2014), Lieberman and Phillips (2017), McCabe and Tremayne (1995), and McCabe and Smith (1998). Relevant empirical applications where the SSR model could potentially provide a better fit than the CVAR model include, but are not limited to, (i) log-prices of assets that exhibit random walk behavior in the levels and heavy-tailed error-correcting dynamics in the

no-arbitrage relations, and (ii) interest rates for which the riskless rate exhibits random walk-type dynamics and the risk premia undergo periods of high levels and high volatility.

The stationary and nonstationary components of the SSR model are treated as unobserved processes, and consequently need to be integrated out in order to compute the log-likelihood function and its derivatives. Due to the nonlinearity of the model, this cannot be accomplished analytically. We appeal to the incomplete data framework and the simulation-based approach known as particle filtering to approximate the log-likelihood function, sample score and observed Information matrix. See *inter alia* Gordon et al. (1993), Doucet et al. (2001), Cappé et al. (2005), and Creal (2012) for an overview of the particle filtering literature. Moreover, we rely on stochastic approximation methods to obtain the maximum likelihood (ML) estimator, see Poyiadjis et al. (2011). Summarizing, the main contributions of this paper are to


It is beyond the scope of this paper to provide a complete proof of the asymptotic properties of the ML estimator. The study of the asymptotic properties of the ML estimator in general state space models, such as the SSR model, is an emerging area of research. Most existing results rely on compactness of the state space, which excludes the SSR model and is generally restrictive. For results in this direction, see e.g., Olsson and Rydén (2008) who derive consistency and asymptotic normality for the ML estimator by discretizing the parameter space. Douc et al. (2011) have shown consistency of the ML estimator without assuming compactness, but the regularity conditions are nonetheless too restrictive to encompass the SSR model. Instead of providing a complete proof of the asymptotic properties of the ML estimator, we conjecture the asymptotic properties of the derivatives of the log-likelihood function. We base the conjecture on known properties of models that are closely related to the SSR model, and corroborate it by a simulation study. Given the conjecture holds, it allows us to establish the asymptotic properties of the ML estimator. We leave proving the conjecture for future work, and focus in this paper on developing methods for approximate frequentist estimation and inference.

The rest of the paper is organized as follows. We introduce the SSR model in Section 2, and study some properties of the process in Section 3. In Section 4 we introduce likelihood-based estimation and inference for the unknown model parameter. In Section 5 we introduce the incomplete data framework. In Section 6 we introduce the particle filter-based approximations to the log-likelihood function, sample score and Information matrix. In Section 7 we propose how to approximate the ML estimator and classic standard errors. In Section 8 we consider model diagnostics. In Section 9 we conduct a simulation study of the asymptotic distribution of the ML estimator. In Section 10 we apply the SSR model to monthly observations of 10-year government bond rates in Germany and Greece from January 1999 to February 2018. We conclude in Section 11. All proofs have been relegated to Appendix B, while Appendix A contains various auxiliary results.

Notation-wise, we adopt the convention that the 'blackboard bold' typeface, e.g., E, denotes operators, and the 'calligraphy' typeface, e.g., X , denotes sets. We thus let R and N denote the real and natural numbers, respectively. For any matrix *A*, we denote by |*A*| the determinant, by *A* <sup>=</sup> ;tr(*AA*) the Euclidean norm, and by *<sup>ρ</sup>*(*A*) the spectral radius. For some positive definite matrix *A*, we let *A*1/2 denote the lower triangular Cholesky decomposition. For some function *<sup>f</sup>* : <sup>R</sup>*dz* → R*<sup>d</sup> <sup>f</sup>* , let *<sup>∂</sup> <sup>f</sup>*(*z*)/*∂<sup>z</sup>* denote the derivative of *<sup>f</sup>*(*z*) with respect to *<sup>z</sup>*. For some stochastic variable *<sup>z</sup>* ∈ R*dz* with Gaussian distribution with mean *<sup>μ</sup>* and covariance <sup>Σ</sup>, let *<sup>N</sup>*(*z*; *<sup>μ</sup>*, <sup>Σ</sup>) denote the Gaussian probability density function evaluated at *z*. We let *p*(*z*) denote the probability density of stochastic variable *<sup>z</sup>* ∈ R*dz* with respect to the *dz*–dimensional Lebesgue measure *<sup>m</sup>*, while *p*(d*z*) = *p*(*z*) d*m* denotes the corresponding probability measure. Additionally, the letter 'p' is generic notation for probability density functions and measures induced by the model defined in (1)–(3) below. The 'bold' typeface, e.g., *p*, is generic notation for analytically intractable quantities,

in the sense of having no closed-form expression. Finally, we denote a sequence of *n* ∈ N<sup>+</sup> real *dz*–dimensional vectors by *z*1:*<sup>n</sup>* . .= [ *z* <sup>1</sup> ... *z <sup>n</sup>* ] ∈ R*n*×*dz* .

#### **2. The Model**

The structure of the SSR model is similar to the Granger-Johansen representation of the CVAR model, cf. Johansen (1996, chp. 4), but departs from it in two respects. First, the stationary component is a random coefficient autoregressive process, cf. e.g., Feigin and Tweedie (1985), rather than an autoregressive process. Second, the stationary and nonstationary components are observed with measurement error. This makes the SSR model is a state space model, whereas the CVAR model is observation-driven. In addition to resembling the CVAR model, the SSR model constitutes an extension of the CST model, cf. Chang et al. (2009). However, while the CST model is a linear Gaussian state space model, the SSR model is a nonlinear Gaussian state space model as it allows the stationary component to be a random coefficient autoregressive process.

Formally, we consider the observable *p*-dimensional discrete time vector process *yt*, for *t* = 1, 2, . . . , *T* given by,

$$y\_t = \mathbb{C}(y\_0) + B \sum\_{i=1}^t \eta\_i + A\_{\mathbb{S}^1}^x + u\_t \tag{1}$$

$$
\xi\_t = \mu + \Phi\_t \xi\_{t-1} + \nu\_t. \tag{2}
$$

for fixed initial values *y*<sup>0</sup> and *ξ*0, and with *ut*, Φ*<sup>t</sup>* and [*η <sup>t</sup>* , *ν t*] mutually independent. We define *ε<sup>t</sup>* . .= ∑*<sup>t</sup> <sup>i</sup>*=<sup>1</sup> *<sup>η</sup><sup>i</sup>* with *<sup>ε</sup>*<sup>0</sup> <sup>=</sup> <sup>0</sup>*p*−*r*. The sequences *<sup>ε</sup>*1:*<sup>T</sup>* and *<sup>ξ</sup>*1:*<sup>T</sup>* are unobserved and take values *<sup>ε</sup><sup>t</sup>* ∈ R*p*−*<sup>r</sup>* and *<sup>ξ</sup><sup>t</sup>* ∈ R*<sup>r</sup>* for 0 <sup>&</sup>lt; *<sup>r</sup>* <sup>&</sup>lt; *<sup>p</sup>*. Additionally, the matrices are of dimensions *<sup>A</sup>* ∈ R*p*×*<sup>r</sup>* and *<sup>B</sup>* ∈ R*p*×*p*−*<sup>r</sup>* , with [*A B*] ∈ R*p*×*<sup>p</sup>* and invertible. Let the random coefficient, <sup>Φ</sup>*t*, be i.i.d. Gaussian,

$$\text{vec}(\Phi\_t) \sim N(\text{vec}(\Phi), \Omega\_\Phi) \,. \tag{3}$$

with ΩΦ a positive definite covariance matrix. Let the observation error be i.i.d. Gaussian, such that *ut* ∼ *N*(0, Ω*u*) with Ω*<sup>u</sup>* a positive definite matrix, and let the innovations *η<sup>t</sup>* and *ν<sup>t</sup>* be jointly Gaussian such that *<sup>η</sup><sup>t</sup>* <sup>∼</sup> *<sup>N</sup>*(0, <sup>Ω</sup>*η*) and *<sup>ν</sup><sup>t</sup>* <sup>∼</sup> *<sup>N</sup>*(0, <sup>Ω</sup>*ν*) with cross-covariance <sup>C</sup>ov [*ηt*, *<sup>ν</sup>t*] <sup>=</sup> <sup>Ω</sup>*η*, *<sup>ν</sup>*, such that the joint covariance matrix,

$$
\Lambda := \left[ \begin{array}{cc} \Omega\_{\eta} & \Omega\_{\eta,\nu} \\ \Omega'\_{\eta,\nu} & \Omega\_{\nu} \end{array} \right] \; \prime \tag{4}
$$

is positive definite. Let all the introduced matrices be of appropriate dimensions and full rank. Furthermore, we introduce the orthogonal complements to *<sup>A</sup>* and *<sup>B</sup>*, which we denote *<sup>b</sup>* ∈ R*p*×*<sup>r</sup>* and *<sup>a</sup>* ∈ R*p*×*p*−*<sup>r</sup>* , such that *b B* = 0 and *a A* = 0 with *b* and *a* of full column rank. Finally, we let *C*(*y*0) . .= *B*(*a B*)−1*a y*0.

Define the parameter vectors,

$$
\omega := \left[ \begin{array}{cc} \text{vec}(B)' & \text{vec}(A)' & \text{vec}(\Omega\_{\mathfrak{u}})' \end{array} \right]' \tag{5}
$$

$$
\lambda := \left[ \begin{array}{cc} \mu' & \text{vec}(\Phi)' & \text{vec}(\Omega\_{\Phi})' & \text{vec}(\Lambda)' \end{array} \right]', \tag{6}
$$

which contain the parameters governing the observations *yt*, and unobserved components *ε<sup>t</sup>* and *ξt*, respectively. The parameter vectors take values in *ω* ∈ Θ*<sup>ω</sup>* and *λ* ∈ Θ*λ*, respectively. Additionally, we define the full parameter vector as

$$\theta := \left[ \begin{array}{c} \omega' & \lambda' \end{array} \right]' \in \Theta\_{\omega} \times \Theta\_{\lambda} =: \Theta . \tag{7}$$

which indexes the model, and we refer to Θ as the parameter space. Note that *ω* and *λ* in *θ* are variation free in the sense of Engle et al. (1983). The parameter space is a subset of the *dθ*-dimensional Euclidean space <sup>Θ</sup> ⊆ R*d<sup>θ</sup>* , where *<sup>d</sup><sup>θ</sup>* denotes the number of elements in *<sup>θ</sup>*. In the case where no restrictions are imposed on *θ*, the dimension *d<sup>θ</sup>* increases rapidly in *r* due to the <sup>1</sup> <sup>2</sup> (*r*<sup>2</sup> + <sup>1</sup>)*r*<sup>2</sup> parameters in ΩΦ. We suggest restricting the off-diagonal elements of ΩΦ to zero to avoid over-parameterization. The number of parameters is then *d<sup>θ</sup>* = 2*p*<sup>2</sup> + *p* + 2*r*<sup>2</sup> + *r* when the model is otherwise unrestricted.

The log-likelihood function for any parameter vector *<sup>θ</sup>* <sup>∈</sup> <sup>Θ</sup>, fixed initial values *<sup>y</sup>*<sup>0</sup> ∈ R*p*, *<sup>ε</sup>*<sup>0</sup> <sup>=</sup> <sup>0</sup>*p*−*<sup>r</sup>* and *<sup>ξ</sup>*<sup>0</sup> ∈ R*<sup>r</sup>* , and observation sequence *<sup>y</sup>*1:*<sup>T</sup>* ∈ R*p*×*<sup>T</sup>* is given by,

$$\ell\_T(\theta) := \log p\_\theta(\varepsilon\_{0\prime}, \lnot\_{0\prime} y\_{0:T}) \,. \tag{8}$$

The sample score is given by the first derivative of (8),

$$\mathcal{S}\_T(\theta) := \frac{\partial}{\partial \theta} \mathcal{E}\_T(\theta) \,, \tag{9}$$

and the observed Information matrix is given by minus the second derivative of (8),

$$I\_T(\theta) := -\frac{\partial^2}{\partial \theta \partial \theta'} \mathcal{E}\_T(\theta) \,. \tag{10}$$

Due to the nonlinear dynamics of the unobserved process (2), the log-likelihood function (8) and its derivatives (9)–(10) do not have closed-form solutions. In the following, we suppress the dependence on the initial values *ε*0, *ξ*<sup>0</sup> and *y*0, but note they remain fixed.

#### **3. Properties of the Process**

In this section we consider some properties of the process defined by Equations (1)–(3) for a given parameter value *θ* ∈ Θ. Specifically, we study the nonstationary and stationary components, including conditions on the parameter *θ* that ensure strict stationarity of the stationary component. Additionally, we decompose the observation *yt* into nonstationary and stationary directions.

#### *3.1. The Unobserved Components*

The first component of the model, *εt*, is a *random walk* (RW) in *p* − *r* dimensions, equivalently expressed as an autoregressive process with a unit root. That is, for *t* = 1, . . . , *T*,

$$
\varepsilon\_t = \varepsilon\_{t-1} + \eta\_{t\text{ \textquotedblleft}t} \tag{11}
$$

with *ε*<sup>0</sup> = 0*p*−*r*. The process (11) admits the transition density *pλ*(*ε<sup>t</sup>* | *εt*−1) with respect to the *p* − *r*–dimensional Lebesgue measure; however, it does not have a stationary distribution. This type of process has been studied extensively, see e.g., Dickey and Fuller (1979). In summary, the RW process is linear and Gaussian, but nonstationary.

The second unobserved component of the model, *ξt*, is a *random coefficient autoregressive* (RCAR) process of lag order one in *r* dimensions. The RCAR process (2)–(3) is observationally equivalent to a double autoregressive (DAR) process with one lag, cf. Ling (2007), which we formalize in Lemma 1.

**Lemma 1.** *For θ* ∈ Θ*, the random coefficient autoregressive process (2)–(3) with k* = 1 *has the following double autoregressive process representation, t* = 1, 2, . . . , *T*

$$\mathfrak{d}\_t = \mu + \Phi \mathfrak{d}\_{t-1}^x + \Omega\_{\nu, t}^{1/2} z\_t \tag{12}$$

$$
\Omega\_{\mathbb{V},t} = \Omega\_{\mathbb{V}} + \left(\mathfrak{f}\_{t-1}^{\prime} \otimes I\_{\mathbb{I}}\right) \Omega\_{\Phi} \left(\mathfrak{f}\_{t-1}^{\prime} \otimes I\_{\mathbb{I}}\right)^{\prime},\tag{13}
$$

*for <sup>ξ</sup>*<sup>0</sup> *fixed, zt* <sup>∼</sup> *<sup>N</sup>*(0, *Ir*)*, cross-covariance* <sup>C</sup>ov [*ηt*, *zt*] <sup>=</sup> <sup>Ω</sup>*η ν, and with the joint innovation process* [*η <sup>t</sup>*, *z t*] *independent and identically distributed.*

The DAR representation in Lemma 1 of the RCAR process in (2)–(3) characterizes the process dynamics in terms of the conditional mean and variance. The conditional mean <sup>E</sup>*<sup>λ</sup>* [*ξ<sup>t</sup>* <sup>|</sup> *<sup>ξ</sup>t*−1] is autoregressive. However, the conditional variance <sup>V</sup>ar*<sup>λ</sup>* [*ξ<sup>t</sup>* <sup>|</sup> *<sup>ξ</sup>t*−1] depends positively on the lagged level 'squared'. The conditional variance is heteroskedastic, but not in the well-known ARCH sense of e.g., Engle (1982); rather, the lagged level of the process *ξt*−<sup>1</sup> enters the variance, not the lagged innovation *νt*−1. To illustrate the point, we consider for a moment the conditional variance in the univariate case *r* = 1, which is given by *ω*<sup>2</sup> *<sup>ν</sup>*, *<sup>t</sup>* = *ω*<sup>2</sup> *<sup>ν</sup>* + *ω*<sup>2</sup> *φξ*2 *<sup>t</sup>*−1. Here we see that a relatively large (in absolute terms) lagged level |*ξt*−1| will result in a relatively large volatility *ων*, *<sup>t</sup>* in the present period, and vice versa.

We make the following assumption on the random coefficients (3) in order to ensure strict stationarity of the RCAR process (2)–(3).

**Assumption 1.** *Assume that the top Lyapunov exponent is strictly negative,*

$$\gamma := \lim\_{n \to \infty} \frac{1}{n} \mathbb{E}\_{\lambda} \left[ \log \left| \left| \prod\_{t=1}^{n} \Phi\_{t} \right| \right| \right] < 0 \,. \tag{14}$$

**Remark 1.** *The top Lyapunov exponent (14) is intractable but can be approximated to arbitrary precision via simulation, cf. inter alia Ling (2007) and Francq and Zakoian (2010). The following approximation converges almost surely*

$$\gamma\_n := \frac{1}{n} \log \left\| \prod\_{t=1}^n \Phi\_t \right\| \quad \stackrel{a.s.}{\rightarrow} \gamma\_{\prime \prime} \tag{15}$$

*as n* → ∞*. In turn, γ*ˆ*<sup>n</sup> can be computed efficiently via the QR-decomposition, cf. Dieci and Van Vleck (1995).*

Assumption 1 ensures that the RCAR process can be characterized as a geometrically ergodic Markov chain, cf. Meyn and Tweedie (2005). This is formalized in the following theorem.

**Theorem 1** (Feigin and Tweedie (1985), Theorem 3)**.** *Under Assumption 1, the process* {*ξt*}*t*=0, 1, ... *is geometrically ergodic. In particular, the initial value ξ*<sup>0</sup> *can be given an initial distribution p<sup>θ</sup>* (*ξ*0) *such that* {*ξt*}*t*=0, 1, ... *is stationary and geometrically ergodic with some fractional moment.*

**Remark 2.** *The stationary component, ξt, exhibits heavy-tailed behavior since it satisfies a stochastic recurrence equation. Pedersen and Wintenberger (2018) have recently considered the tail properties of processes of the form (2) for a more general specification of the random coefficient,* Φ*t, that includes BEKK-ARCH and DAR-type processes as special cases. It should be possible to show that the stationary distribution of ξ<sup>t</sup> as defined in (2)–(3) also has power-law tails under suitable conditions.*

The RCAR process (2)–(3) admits the transition density *pλ*(*ξ<sup>t</sup>* | *ξt*−1) with respect to the *r*-dimensional Lebesgue measure. Moreover, the process has the stationary distribution *p<sup>θ</sup>* (*ξt*) under Assumption 1. In summary, the RCAR process is Gaussian and strictly stationary, but nonlinear.

#### *3.2. The Observed Process*

The observations {*yt*}*t*=1, 2, ... are conditionally independent given the sequence of unobserved components {*εt*, *ξt*}*t*<sup>=</sup>1, 2, .... Thus, the dynamics of the observed process are determined by the dynamics of the unobserved components.

We use the orthogonal complements *b* and *a* of the loading matrices *B* and *A*, respectively, and the skew-projection identity of Johansen (1996) to decompose the observation vector *yt* as follows,

$$y\_t = B\_a a' y\_t + A\_b b' y\_t \, , \tag{16}$$

where we define *Ba* . .= *B*(*a B*)−<sup>1</sup> and *Ab* . .= *A*(*b A*)−1. Here *a B* and *b A* are invertible thanks to our assumption that [*A B*] is square and invertible. By premultiplying *yt* by *a* we eliminate the stationary directions, while leaving the nonstationary directions,

$$a'y\_t = a'\mathbb{C}(y\_0) + a'B\varepsilon\_t + a'u\_t\,. \tag{17}$$

What is left after the linear transformation (17) is a random walk with Gaussian measurement error. Similarly, premultiplying *yt* by *b* eliminates the nonstationary directions while the stationary directions remain,

$$
\delta^\prime y\_\mathbf{l} = b^\prime A \vec{y}\_\mathbf{l} + b^\prime u\_\mathbf{l} \,. \tag{18}
$$

The process given by (18) is a stationary random coefficient autoregressive process with Gaussian measurement error.

The decomposition of the observation process (16) allows for a cointegration interpretation of the SSR model. The *p* observed variables in *yt* share *p* − *r* common stochastic trends (17) with loading matrix *Ba*, while the *r* linear combinations (18) are stationary and load into the levels with the matrix *Ab*. The observed process admits the conditional density *<sup>p</sup><sup>θ</sup>* (*yt* <sup>|</sup> *<sup>y</sup>*1:*t*−1) with respect to the *p*–dimensional Lebesgue measure; however, this density does not have a closed-form expression. Moreover, the observed process does not have a stationary distribution.

#### **4. Likelihood-Based Estimation and Inference**

In this section, we introduce the ML estimator and consider its asymptotic properties. We wish to conduct estimation and inference based on the true, but intractable, model likelihood. Due to the intractability of the likelihood, we can neither compute the ML estimator via numerical optimization of (8), nor compute classic standard errors via the observed Information matrix (10). We refer to the ML estimator as being 'doubly intractable', with reference to the concept from the literature in Bayesian statistics on models with intractable likelihoods, see e.g., Murray et al. (2006). It is beyond the scope of this paper to derive a full asymptotic theory for the SSR model. Instead, we conjecture the limiting properties of the likelihood function (8) and its derivatives (9)–(10). We obtain the asymptotic properties for the ML estimator based on the conjecture.

We recall preliminarily that the ML estimator is defined as the parameter vector *θ* ∈ Θ that maximizes the log-likelihood function (8),

$$\theta\_T := \operatorname\*{arg\,sup}\_{\theta \in \Theta} \mathbb{E}\_T \left( \theta \right) \,, \tag{19}$$

noting that the ML estimator (19) is a function of the observation sequence *y*1:*T*. We denote by *θ*<sup>∗</sup> ∈ Θ the true parameter value for the data generating process (1)–(3). In the following, we make the below conjecture on the asymptotic properties of (8)–(10). Note that, having assumed that *B*∗ is known, the score, information, and likelihood in the conjecture refer to the unknown parameters only; that is, all elements in *θ* excluding vec(*B*).

**Conjecture 1.** *If Assumption <sup>1</sup> holds, <sup>B</sup>*<sup>∗</sup> *is known, and <sup>θ</sup>*<sup>∗</sup> <sup>∈</sup> <sup>Θ</sup> ⊆ R*d<sup>θ</sup> , then the log-likelihood function -<sup>T</sup>*(·) : <sup>R</sup>*d<sup>θ</sup>* → R *is three times continuously differentiable in <sup>θ</sup>, and*


*where* N (*θ*∗) *is a neighborhood of θ*<sup>∗</sup> *and* 0 ≤ *cT P* → *c,* 0 < *c* < ∞*, as T* → ∞*.* **Remark 3.** *Theorem 3 in Bohn Nielsen and Rahbek (2014) shows that Conjecture 1 holds in the case of the strictly stationary bivariate double autoregressive model with BEKK-type time-varying covariance. With B*∗ *known, the SSR model corresponds closely to this model plus Gaussian measurement errors.*

It should be noted that we propose Conjecture 1 despite lack of finite moments of the RCAR process, cf. Theorem 1. This is in line with the results of *inter alia* Bohn Nielsen and Rahbek (2014) for the bivariate DAR model, and Ling (2004, 2007) for the univariate DAR model.

The result in Theorem 2 below states that if Conjecture 1 holds true, then the ML estimator (19) is unique, <sup>√</sup>*T*–consistent and asymptotically Gaussian. The result follows from applying Lemma 1 in Jensen and Rahbek (2004), the conditions of which correspond to (1.)–(3.) of Conjecture 1.

**Theorem 2** (Jensen and Rahbek (2004), Lemma 1)**.** *If Conjecture 1 holds, then there exists a fixed open neighborhood* U(*θ*∗) ⊆ N (*θ*∗) *of the true parameter θ*∗*, which is an interior point of* Θ*, such that with probability tending to one as <sup>T</sup>* <sup>→</sup> <sup>∞</sup>*, there exists a minimum point <sup>θ</sup>***ˆ***<sup>T</sup> in* <sup>U</sup>(*θ*∗) *and -<sup>T</sup>*(*θ*) *is convex in* U(*θ*∗)*. In particular, θ***ˆ***<sup>T</sup> is unique and satisfies the score equation*

$$S\_T(\hat{\theta}\_T) = 0.\tag{20}$$

*Additionally, the ML estimator is consistent <sup>θ</sup>***ˆ***<sup>T</sup>* <sup>→</sup> *<sup>θ</sup>*∗*, and asymptotically Gaussian,*

$$\sqrt{T}(\mathfrak{H}\_{\mathrm{T}}-\mathfrak{H}^\*) \quad \xrightarrow{D} \quad \mathrm{N}(\mathfrak{O}, \Omega\_{\mathrm{I}}^{-1}\Omega\_{\mathrm{S}}\Omega\_{\mathrm{I}}^{-1}) , \quad T \to \infty \,. \tag{21}$$

**Proof.** Conjecture 1 satisfies the Cramer-type conditions of Lemma 1 in Jensen and Rahbek (2004), which provides the result.

We assume that the true value of *B* is known, because Chang et al. (2009) showed that the ML estimator of the loading matrix *B* exhibits *T*-convergence and is asymptotically mixed Gaussian in the CST model. The CST model corresponds to the SSR model with *p* − *r* = 1, but without the stationary components, i.e., *A* = 0*p*×*<sup>r</sup>* for any *p*. We find it reasonable to believe that this result carries over to the SSR model. Moreover, fixing *B* is conceptually similar to classic cointegration analysis with known cointegrating vectors, which is an accepted starting point for new methodological developments, see e.g., Bec and Rahbek (2004). In applications we often have a predefined set of cointegrating vectors that we are interested in. In the context of the SSR model, the cointegrating vectors correspond to the rows of the orthogonal complement *b* . As an example, for the empirical illustration in Section 10 we consider an interest rate spread in a bivariate system with one common stochastic trend, i.e., *p* = 2 and *p* − *r* = 1. The spread implies *b* = [ 1 −1 ], which in turn corresponds to the loading matrix *B* = [ 1 1 ] when normalizing on the first element.

The Fisher Information matrix, Ω*I*, is consistently estimated by the (scaled) observed Information matrix evaluated at *θ***ˆ***T*, cf. Conjecture 1.(3.). Moreover, the asymptotic variance of the score, Ω*S*, is equal to the Fisher Information matrix when the model is well-specified; the information matrix equality holds, cf. e.g., Hamilton (1994, sct. 14.4). In this case, the asymptotic variance of the ML estimator (19) is simply the inverse Fisher Information matrix. Thus, we can use classic standard errors, that are based on the observed Information matrix (10), to conduct inference on the ML estimates.

#### **5. The Incomplete Data Framework**

In this section, we appeal to the incomplete data framework of Dempster et al. (1977) to deal with the unobserved components of the SSR model. We first formulate the state space representation of the model in (1)–(3) and its associated optimal filtering problem. Secondly, we formulate the intractable sample score (9) and observed information matrix (10) in terms of the *optimal filtering problem*. In Section 6 we introduce a particle filter algorithm with which we can approximate the optimal filtering problem. This enables approximation of the intractable sample score and observed information matrix via the particle filter algorithm.

#### *5.1. The State Space Form and the Optimal Filtering Problem*

Preliminarily, we collect the unobserved components in the vector *xt* . .= [ *ε <sup>t</sup> ξ t* ] , which we refer to as the *state vector*. The unobserved components are Markov, see (11)–(13), and the observation depends only on the contemporary values of the unobserved components. Thus, the SSR model in (1)–(3) has the dependency structure of a state space model. Formally, for *t* = 1, ... , *T*, the SSR model in (1)–(3) has the following state space representation,

$$y\_t = \mathbb{C}(y\_0) + \Pi x\_t + \Omega\_u^{1/2} u\_t \tag{22}$$

$$\mathbf{x}\_t = \mathbf{a} + \Gamma \mathbf{x}\_{t-1} + \boldsymbol{\Lambda}\_t^{1/2} \boldsymbol{v}\_{t\ \prime} \tag{23}$$

with *y*<sup>0</sup> and *x*<sup>0</sup> fixed, *ut* ∼ *N*(0, *Ip*) and *vt* ∼ *N*(0, *Ip*), and *ut* and *vt* mutually independent. We define accordingly,

$$\Pi := \begin{bmatrix} \mathcal{B}' \\ A' \end{bmatrix}', \quad a := \begin{bmatrix} 0 \\ \mu \end{bmatrix}, \quad \Gamma := \begin{bmatrix} I\_{p-r} & 0 \\ 0 & \Phi \end{bmatrix} \quad \text{and} \quad \Lambda\_I := \begin{bmatrix} \Omega\_{\eta} & \Omega\_{\eta,\nu} \\ \Omega\_{\eta,\nu}' & \Omega\_{\nu,t} \end{bmatrix}, \tag{24}$$

and recall that Ω*ν*, *<sup>t</sup>* is defined in Lemma (1). We refer to (22) as the *observation equation*, and to (23) as the *transition equation*. It is easy to verify that the state space representation in (22) and (23) is observationally equivalent to the SSR model as presented in (1)–(3). The observation and transition equations admit the densities with respect to the *p*-dimensional Lebesgue measure,

$$p\_{\omega} \left( y\_t \mid \mathbf{x}\_t \right) = \mathcal{N}(y\_t; \mathcal{C}(y\_0) + \Pi \mathbf{x}\_t, \Omega\_{\mathbf{u}}) \tag{25}$$

$$p\_{\lambda} \left( \mathbf{x}\_{t} \mid \mathbf{x}\_{t-1} \right) = N(\mathbf{x}\_{t}; \mathbf{a} + \Gamma \mathbf{x}\_{t-1}, \Lambda\_{l}) \,, \tag{26}$$

respectively. We refer to (25) as the *observation density* and to (26) as the *transition density*. As mentioned previously, we suppress the dependence on the initial observation *y*0.

One approach to conducting inference on the unobserved components, i.e., the state vector *xt*, is the optimal filtering problem, cf. Anderson and Moore (1979). The optimal filtering problem refers to the general problem of computing the conditional expectation of some sequence of unobserved states given some sequence of observations. In the following, we consider the specific instance of the optimal filtering problem known as the *smoothing problem*. Formally, the smoothing problem is a conditional expectation of the form,

$$\mathbb{E}\_{\theta} \left[ \gamma\_{t}(\mathbf{x}\_{1:t}) \mid y\_{1:t} \right] = \int \gamma\_{t}(\mathbf{x}\_{1:t}) \, p\_{\theta} \left( \mathbf{x}\_{1:t} \mid y\_{1:t} \right) \, \mathrm{d}\mathbf{x}\_{1:t} \, \tag{27}$$

for any function *<sup>γ</sup>t*(*x*1:*t*) <sup>∈</sup> *<sup>L</sup>*<sup>1</sup> <sup>R</sup>*t p*, *<sup>p</sup><sup>θ</sup>* (*x*1:*<sup>t</sup>* <sup>|</sup> *<sup>y</sup>*1:*t*) and point in time *t* ∈ {1, ... , *T*}. We refer to the function *<sup>γ</sup>t*(*x*1:*t*) as the *test function* and to the density *<sup>p</sup><sup>θ</sup>* (*x*1:*<sup>t</sup>* <sup>|</sup> *<sup>y</sup>*1:*t*) as the *smoothing density*. The test function may be time-varying, but of known form for a fixed observation sequence *y*1:*T*. The smoothing density in (27) can be expressed as the recursion of the lagged smoothing density,

$$p\_{\boldsymbol{\theta}}\left(\mathbf{x}\_{1:t} \mid y\_{1:t}\right) = \frac{p\_{\boldsymbol{\omega}}\left(y\_t \mid \mathbf{x}\_t\right) p\_{\boldsymbol{\lambda}}\left(\mathbf{x}\_t \mid \mathbf{x}\_{t-1}\right)}{p\_{\boldsymbol{\theta}}\left(y\_t \mid y\_{1:t-1}\right)} p\_{\boldsymbol{\theta}}\left(\mathbf{x}\_{1:t-1} \mid y\_{1:t-1}\right) \tag{28}$$

initialized with *<sup>p</sup><sup>θ</sup>* (*x*<sup>1</sup> <sup>|</sup> *<sup>x</sup>*0, *<sup>y</sup>*0, *<sup>y</sup>*1). The normalizing constant in (28) is the likelihood contribution, which is given by the integral,

$$p\_{\boldsymbol{\theta}}\left(\boldsymbol{y}\_{t}\mid\boldsymbol{y}\_{1:t-1}\right) = \int p\_{\boldsymbol{\omega}}\left(\boldsymbol{y}\_{t}\mid\boldsymbol{\mathbf{x}}\_{t}\right) p\_{\boldsymbol{\lambda}}\left(\boldsymbol{x}\_{t}\mid\boldsymbol{x}\_{t-1}\right) p\_{\boldsymbol{\theta}}\left(\boldsymbol{x}\_{1:t-1}\mid\boldsymbol{y}\_{1:t-1}\right) \,\mathrm{d}\mathbf{x}\_{1:t} \,. \tag{29}$$

We note the smoothing density recursion (28) is intractable due to the intractability of the likelihood contribution (29). In the following, we will use the smoothing problem (27) to address computation of the sample score (9) and observed Information matrix (10).

#### *5.2. The Sample Score and Observed Information as Smoothing Problems*

The incomplete data framework is closely associated with the classic expectation maximization (EM) algorithm, introduced in Dempster et al. (1977). The EM algorithm is a common approach to maximizing the log-likelihood function (8) to obtain the ML estimator (19) for models with unobserved variables. When the EM algorithm is applicable, it is also possible to evaluate the sample score (9) and observed Information matrix (10). For the SSR model, however, the EM algorithm does not apply directly, yet we may use the incomplete data framework to reformulate the sample score and observed Information in terms of intractable smoothing problems of the form (27).

A central concept of the EM algorithm is the auxiliary function called the *intermediate quantity*, which is defined as,

$$\mathcal{Q}\_T\left(\boldsymbol{\theta} \mid \boldsymbol{\theta}\right) := \int \log p\_{\boldsymbol{\theta}}\left(y\_{1:T\prime} \ge\_{1:T} \right) p\_{\boldsymbol{\theta}}\left(\mathbf{x}\_{1:T} \mid y\_{1:T}\right) \,\mathrm{d}\mathbf{x}\_{1:T}$$

$$= \mathcal{E}\_T(\boldsymbol{\theta}) - H\_T(\boldsymbol{\theta} \mid \boldsymbol{\theta}) \,\mathrm{d}\mathbf{x}\_{1:T} \tag{30}$$

where

$$H\_T(\theta \mid \theta) := -\int \log p\_\theta(\mathbf{x}\_{1:T} \mid y\_{1:T}) p\_\theta(\mathbf{x}\_{1:T} \mid y\_{1:T}) \, \mathrm{d}\mathbf{x}\_{1:T} \, \mathrm{d}\mathbf{x}\_{1:T} \tag{31}$$

for any parameter values *θ*, *ϑ* ∈ Θ. We refer to log *p<sup>θ</sup>* (*y*1:*T*, *x*1:*T*) as the *complete data log-likelihood*. By the state space model structure (22)–(23) and variation freeness of *θ* defined in (7), we have that the complete data log-likelihood is given by,

$$\log p\_{\theta} \left( y\_{1:T\prime} \ge\_{1:T} \right) = \sum\_{t=1}^{T} \left[ \log p\_{\omega} \left( y\_t \mid \mathbf{x}\_t \right) + \log p\_{\lambda} \left( \mathbf{x}\_t \mid \mathbf{x}\_{t-1} \right) \right]. \tag{32}$$

The intermediate quantity (30) is sometimes also called the expected log-likelihood, since it is interpretable as the conditional expectation of the complete data log-likelihood (32) given the observations *y*1:*T*. We note the term separating the log-likelihood (8) and the intermediate quantity (30) is the entropy of the smoothing density (28) with parameters *ϑ* and *θ*, defined in (31).

We are interested in the intermediate quantity (30) because it provides a convenient way to derive the sample score and observed Information matrix in terms of the derivatives of the complete data log-likelihood (32). The first and second derivatives of the complete data log-likelihood function in (32) are the sum of the first and second order derivatives of the observation and transition log-densities with respect to *ω* and *λ*, respectively. These can be computed by either analytical or numerical differentiation of (32). For *ϑ* ∈ Θ, we define the derivatives of (32) in terms of the functions,

$$\mathcal{U}\_{T}\left(\mathbf{x}\_{1:T};\theta\right) := \frac{\partial}{\partial\theta}\log p\_{\theta}\left(y\_{1:T\prime},\mathbf{x}\_{1:T}\right)\big|\_{\theta=\theta} = \sum\_{t=1}^{T} u\_{t}\left(\mathbf{x}\_{t\prime},\mathbf{x}\_{t-1\prime};\theta\right) \tag{33}$$

$$V\_T\left(\mathbf{x}\_{1:T};\,\theta\right) := \frac{\partial^2}{\partial\theta\partial\theta'}\log p\_\theta\left(y\_{1:T\_\prime},\mathbf{x}\_{1:T}\right)\big|\_{\theta=\theta} = \sum\_{t=1}^T v\_t\left(\mathbf{x}\_t,\mathbf{x}\_{t-1};\,\theta\right),\tag{34}$$

where, taking advantage of the variation freeness of the model parameter, *θ*, we define the summands of (33) and (34), respectively, as

$$u\_t \left( \mathbf{x}\_t \mid \mathbf{x}\_{t-1}; \theta \right) := \begin{bmatrix} \frac{\partial}{\partial \omega} \log p\_{\omega} \left( y\_t \mid \mathbf{x}\_t \right) \\\ \frac{\partial}{\partial \lambda} \log p\_{\lambda} \left( \mathbf{x}\_t \mid \mathbf{x}\_{t-1} \right) \end{bmatrix} \Big|\_{\theta = \theta} \tag{35}$$

and

$$\left. \psi\_t \left( \mathbf{x}\_t, \mathbf{x}\_{t-1}; \theta \right) := \begin{bmatrix} \frac{\partial^2}{\partial \omega \partial \omega'} \log p\_\omega \left( y\_t \mid \mathbf{x}\_t \right) & 0\_{d\_\omega \times d\_\lambda} \\ 0\_{d\_\lambda \times d\_\omega} & \frac{\partial^2}{\partial \lambda \partial \lambda'} \log p\_\lambda \left( \mathbf{x}\_t \mid \mathbf{x}\_{t-1} \right) \end{bmatrix} \Big|\_{\theta = \theta} \tag{36}$$

We note that the functions (35) and (36) should not be confused with the measurement error in (22) and innovations in (23), respectively.

If the first and second order derivatives of the complete data log-likelihood in (33) and (34), respectively, are integrable with respect to the smoothing density (28), then we may appeal to Fisher's and Louis' identities (defined below) to express the sample score (9) and observed Information matrix (10) in terms of smoothing problems of the form (27).

**Conjecture 2.** *For any <sup>θ</sup>* <sup>∈</sup> <sup>Θ</sup> *and observation sequence <sup>y</sup>*1:*<sup>T</sup>* ∈ R*p*×*T, it holds that UT*(*x*1:*T*; *<sup>θ</sup>*) <sup>∈</sup> *<sup>L</sup>*2[R*p*×*T*, *<sup>p</sup><sup>θ</sup>* (*x*1:*<sup>T</sup>* <sup>|</sup> *<sup>y</sup>*1:*T*)] *and VT*(*x*1:*T*; *<sup>θ</sup>*) <sup>∈</sup> *<sup>L</sup>*1[R*p*×*T*, *<sup>p</sup><sup>θ</sup>* (*x*1:*<sup>T</sup>* <sup>|</sup> *<sup>y</sup>*1:*T*)]*.*

For the same reasons we conjectured the asymptotic properties of the true log-likelihood function, sample score, observed information matrix, we conjecture integrability of the derivatives of the complete data log-likelihood (33) and (34).

Fisher's identity, cf. Dempster et al. (1977), states the first derivative of the intermediate quantity (30) is equivalent to the sample score (9). Similarly, Louis' identity of Louis (1982) establishes a relation between the first and second derivatives of the intermediate quantity (30) and the observed Information matrix (10).

**Lemma 2** (Fisher's and Louis' identities, cf. Cappé et al. (2005), Proposition 10.1.6)**.** *If Conjecture 2 holds and θ* ∈ Θ*, then the sample score (9) is equivalently given by*

$$\mathbf{S}\_{T}\left(\theta\right) = \int \mathcal{U}\_{T}(\mathbf{x}\_{1:T}; \theta) \mathbf{p}\_{\theta}\left(\mathbf{x}\_{1:T} \mid \mathbf{y}\_{1:T}\right) \, d\mathbf{x}\_{1:T} \,\tag{37}$$

*and the observed Information (10) is equivalently given by*

$$I\_T\left(\theta\right) = \mathbb{S}\_T\left(\theta\right)\mathbb{S}\_T\left(\theta\right)^\prime - \mathbb{G}\_T\left(\theta\right) - \mathbb{K}\_T\left(\theta\right),\tag{38}$$

*where*

$$\mathbf{G}\_{T}(\theta) := \int V\_{T}(\mathbf{x}\_{1:T}; \theta) p\_{\theta} \left( \mathbf{x}\_{1:T} \mid y\_{1:T} \right) \, \mathrm{d}x\_{1:T} \tag{39}$$

$$\mathbf{K}\_{T}(\theta) := \int \mathcal{U}\_{T}(\mathbf{x}\_{1:T}; \theta) \mathcal{U}\_{T}(\mathbf{x}\_{1:T}; \theta)' p\_{\theta} \left(\mathbf{x}\_{1:T} \mid y\_{1:T}\right) \,\mathrm{d}\mathbf{x}\_{1:T} \,\mathrm{d}\mathbf{x} \tag{40}$$

*and the functions UT*(*x*1:*T*; *θ*) *and VT*(*x*1:*T*; *θ*) *are defined in (33) and (34), respectively.*

Although Lemma 2 shows the sample score (9) and observed Information (10) can be restated as smoothing problems of the form (27), we still cannot obtain closed-form expressions due to the intractability of the optimal filtering problem, cf. Section 5.1. In the next section, we introduce a particle filter algorithm that can approximate smoothing problems for appropriately chosen test functions, such as the functions *UT*(*x*1:*T*; *θ*) and *VT*(*x*1:*T*; *θ*) under Conjecture 2.

#### **6. Particle Filter-Based Approximations**

In this section, we introduce a particle filter algorithm that produces pointwise approximations to the true but intractable log-likelihood function (8), sample score (9), and observed Information matrix (10) for any parameter *<sup>θ</sup>* <sup>∈</sup> <sup>Θ</sup> and fixed observation sequence *<sup>y</sup>*1:*<sup>T</sup>* ∈ R*p*×*T*. In Section 7, we show how to apply the particle filter-based approximations introduced in this section to approximate the true, intractable ML estimator and classic standard errors, which we introduced in Section 4.

#### *6.1. Particle Filtering*

A particle filter is a simulation-based algorithm that produces approximations to smoothing problems of the form (27) for state space models. We introduce here a standard particle filter, which produces empirical measures that recursively approximate the smoothing density (28) for each time point in the observed sample *t* ∈ {1, ... , *T*}. The empirical measures consist of point masses, which we refer to as *particles*, and we use these for Monte Carlo integration in order to approximate the smoothing problem (27). Additionally, the particle filter produces a point-wise approximation of the log-likelihood function as a by-product. For an introduction to particle filtering in the context of economics and finance see Creal (2012).

The particle filter algorithm relies on an *importance density*, denoted *q<sup>θ</sup>* (*x*1:*<sup>t</sup>* | *y*1:*t*), that has the same support and recursive structure as the smoothing density (28). Formally, for *t* = 1, ... , *T*, we define the importance density as,

$$q\_{\theta} \left( \mathbf{x}\_{1:t} \mid y\_{1:t} \right) := q\_{\theta} \left( \mathbf{x}\_{t} \mid \mathbf{x}\_{t-1}, y\_{t} \right) q\_{\theta} \left( \mathbf{x}\_{1:t-1} \mid y\_{1:t-1} \right) \,, \tag{41}$$

initialized by *q<sup>θ</sup>* (*x*<sup>1</sup> | *x*0, *y*0, *y*1). We note the importance density (41) is defined recursively by *q<sup>θ</sup>* (*xt* | *xt*−1, *yt*), which we refer to as the *importance transition density*.

Assuming the smoothing density (28) is absolutely continuous with respect to the importance density (41), we can write the former as a the product of the importance density and a weight function,

$$p\_{\theta} \left( \mathbf{x}\_{1:t} \mid y\_{1:t} \right) = \overline{w}\_{t} \left( \mathbf{x}\_{1:t} \right) q\_{\theta} \left( \mathbf{x}\_{1:t} \mid y\_{1:t} \right), \quad \overline{w}\_{t} \left( \mathbf{x}\_{1:t} \right) := \frac{p\_{\theta} \left( \mathbf{x}\_{1:t} \mid y\_{1:t} \right)}{q\_{\theta} \left( \mathbf{x}\_{1:t} \mid y\_{1:t} \right)}. \tag{42}$$

We refer to the weight function *w***¯** *<sup>t</sup>* (*x*1:*t*) as the *normalized importance weight*. We note that (42) constitutes a change of measure from the smoothing density to the importance density, and the normalized importance weight is a Radon-Nikodym derivative between the two densities.

Substituting the recursive expressions for the smoothing density (28) and importance density (41) into the expression for the normalized importance weight in (42), we obtain a recursive expression for the normalized importance weight,

$$
\mathfrak{w}\_t(\mathbf{x}\_{1:t}) = \frac{\mathfrak{w}\_t\left(\mathbf{x}\_{t-1:t}\right)}{\mathfrak{p}\_\theta\left(\mathcal{Y}\_t \mid \mathcal{Y}\_{1:t-1}\right)} \mathfrak{w}\_{t-1}\left(\mathbf{x}\_{1:t-1}\right) \tag{43}
$$

where we define

$$
\tilde{w}\_t\left(\mathbf{x}\_{t-1:t}\right) := \frac{p\_{\omega}\left(y\_t \mid \mathbf{x}\_t\right) p\_\lambda\left(\mathbf{x}\_t \mid \mathbf{x}\_{t-1}\right)}{q\_\theta\left(\mathbf{x}\_t \mid \mathbf{x}\_{t-1}, y\_t\right)}.\tag{44}$$

We refer to (44) as the *incremental importance weights*. The recursion for the normalized importance weight (43) is normalized by the likelihood contribution (29) and is therefore also intractable.

For particle filtering in general, the importance transition density is subject to choice under mild regularity conditions, cf. e.g., Assumption 9.4.1 in Cappé et al. (2005). We let the importance transition density be the corresponding model density; formally,

$$q\_{\theta} \left( \mathbf{x}\_{t} \mid \mathbf{x}\_{t-1}, \mathcal{Y}\_{t} \right) := p\_{\theta} \left( \mathbf{x}\_{t} \mid \mathbf{x}\_{t-1}, \mathcal{Y}\_{t} \right) \,. \tag{45}$$

We refer to (45) as the *locally optimal transition density*. This choice of importance transition density is optimal in the sense that it is conditional on the the contemporary observation *yt*, cf. Doucet et al. (2000). This is sometimes also referred to as 'fully adapted', cf. e.g., Pitt and Shephard (1999b). If we instead let the importance transition density be the model transition density (26), we omit the information about *xt* that is contained in *yt*. The locally optimal transition density is not necessarily available in closed-form for nonlinear state space models. It is, however, available for the SSR model and we present it in Lemma 3.

**Lemma 3.** *For θ* ∈ Θ*, the locally optimal transition density has the closed-form expression*

$$p\_{\boldsymbol{\theta}}\left(\mathbf{x}\_{t}\mid\mathbf{x}\_{t-1},\ y\_{t}\right) = \mathcal{N}(\mathbf{x}\_{t};\boldsymbol{\mu}\_{t|t'}^{\boldsymbol{x}},\boldsymbol{\Sigma}\_{t|t}^{\boldsymbol{x}})\,,\tag{46}$$

*where the conditional mean and variance are given by,*

$$
\mu\_{t|t}^x = \mu\_{t|t-1}^x + \Sigma\_{t|t-1}^x \Pi^\prime \left[\Sigma\_{t|t-1}^y\right]^{-1} \left(y\_t - \mu\_{t|t-1}^y\right) \tag{47}
$$

$$
\Sigma\_{t|t}^{\mathcal{X}} = \Sigma\_{t|t-1}^{\mathcal{X}} - \Sigma\_{t|t-1}^{\mathcal{X}} \Pi^{\prime} \left[ \Sigma\_{t|t-1}^{\mathcal{Y}} \right]^{-1} \Pi \Sigma\_{t|t-1}^{\mathcal{X}} \tag{48}
$$

*with*

$$
\mu\_{t|t-1}^y = \mathcal{C}(y\_0) + \Pi \mu\_{t|t-1}^x \tag{49}
$$

$$
\Sigma\_{t|t-1}^y = \Pi \Sigma\_{t|t-1}^x \Pi' + \Omega\_u \tag{50}
$$

$$
\mu\_{t|t-1}^x = a + \Gamma \mathbf{x}\_{t-1} \tag{51}
$$

$$
\Sigma\_{t|t-1}^x = \Lambda\_{t|t-1} \tag{52}
$$

*and the state space form definitions given in (24).*

**Remark 4.** *The locally optimal transition density (46) is related to the Kalman (1960) filter, which solves the optimal filtering problem analytically for linear and Gausian models. Equations (49)–(52) correspond the Kalman filter for a known value of xt*−1*. Related methods for efficient particle filtering include the mixture Kalman filter and Rao-Blackwellisation, cf. Chen and Liu (2000) and Andrieu and Doucet (2002).*

It is straightforward to use the general expression for the incremental importance weight in (44) to show that letting the importance transition density be the locally optimal transition density, i.e., (45), results in the following specific expression for incremental importance weights,

$$
\bar{w}\_t(\mathbf{x}\_{t-1}) = p\_\theta \left( y\_t \mid \mathbf{x}\_{t-1} \right) \,. \tag{53}
$$

We refer to the density in (53) as the *predictive observation density*. It has a closed-form expression that follows from the closed-form expression of the locally optimal transition density in Lemma 3.

**Corollary 1.** *For θ* ∈ Θ*, the predictive observation density has the closed-form expression*

$$p\_{\theta} \left( y\_t \mid \mathbf{x}\_{t-1} \right) = N(y\_t; \boldsymbol{\mu}\_{t|t-1}^y, \boldsymbol{\Sigma}\_{t|t-1}^y) \,, \tag{54}$$

*recalling the definitions in (49)–(52).*

**Proof.** Contained in the proof of Lemma 3.

**Remark 5.** *The choice of importance transition density (45) is locally optimal in the sense that the conditional variance of the incremental importance weights (53) given xt*−<sup>1</sup> *is zero, cf. Doucet et al. (2000).*

The particle filter, presented in Algorithm 1 below, produces weighted particle samples approximately distributed as the smoothing density (28) at each point in time *t* = 1, ... , *T*. The algorithm consists of iterating over three steps. At point *t* in time, the first step is to sample *N* particles, denoted {*x*˜ (*i*) 1:*t*}*<sup>N</sup> <sup>i</sup>*=1, from the importance density (41) given the particle sample from *t* − 1. This is called the *propagation step*. Step two consists of computing self-normalized importance weights, denoted {*w*¯ (*i*) *<sup>t</sup>* }*<sup>N</sup> <sup>i</sup>*=1, that approximate the normalized importance weights (43). This is the *weighting step*. The third step is to sample *<sup>N</sup>* particle indices, denoted {*I*(*i*)}*<sup>N</sup> <sup>i</sup>*=1, with replacement. We sample index *j* with probability *w*¯ (*j*) *<sup>t</sup>* for *j* ∈ {1, ... , *N*}. We retain the number of particles indicated by the resulting sample of particle indices, denoted {*x* (*i*) 1:*t*}*<sup>N</sup> <sup>i</sup>*=1, and let the importance weights be uniform. This is the *resampling step*. After resampling, we store the particle samples and proceed to *t* + 1.

For a fixed parameter value *<sup>θ</sup>* <sup>∈</sup> <sup>Θ</sup> and observation sequence *<sup>y</sup>*1:*<sup>T</sup>* ∈ R*p*×*T*, we run the locally optimal particle filter for the SSR model as specified in Algorithm 1 below.

#### **Algorithm 1:** Locally Optimal Particle Filter.

Given a parameter *θ* ∈ Θ, initialize by setting *x* (*i*) <sup>0</sup> . .= *x*<sup>0</sup> and *w*¯ (*i*) <sup>0</sup> . .= 1/*N* for *i* = 1, ... , *N*. For *t* = 0, 1, . . . , *T*:

1. Sample particles {*x*˜ (*i*) *<sup>t</sup>* }*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> with distribution

$$\mathbf{x}\_t^{(i)} \sim p\_\theta(\mathbf{x}\_t \mid \mathbf{x}\_{t-1'}^{(i)}, y\_t)\_\prime \tag{55}$$

and set *x*˜ (*i*) 1:*t* . .= [ *x* (*i*) 1:*t*−<sup>1</sup> *<sup>x</sup>*˜ (*i*) *<sup>t</sup>* ] for *i* = 1, 2, . . . , *N*.

2. Calculate the unnormalized importance weights, {*w*(*i*) *<sup>t</sup>* }*<sup>N</sup> <sup>i</sup>*=1,

$$w\_t^{(i)} = p\_\theta(y\_t \mid \mathfrak{x}\_{t-1,t}^{(i)}) \vartheta\_{t-1}^{(i)} \tag{56}$$

for *i* = 1, . . . , *N*. Then compute the normalized importance weights

$$w\_t^{(i)} = \frac{w\_t^{(i)}}{W\_t^N}, \quad \mathcal{W}\_t^N := \sum\_{i=1}^N w\_t^{(i)}, \tag{57}$$

for *i* = 1, . . . , *N*.

3. Sample *<sup>N</sup>* particle indices {*I*(*i*)}*<sup>N</sup> <sup>i</sup>*=1, *<sup>I</sup>*(*i*) ∈ {1, . . . , *<sup>N</sup>*}, with probabilities

$$\Pr(I^{(i)} = j \mid \mathcal{F}\_{t\prime} \ y\_{1:t}) = \psi\_{t}^{(j)}, \quad j \in \{1, \ldots, N\} \tag{58}$$

for *i* = 1, . . . , *N*. Set the resampled particles *x* (*i*) 1:*t* . .= *x*˜ (*I*(*i*)) 1:*<sup>t</sup>* , and the normalized importance weights *w*¯ (*i*) *<sup>t</sup>* . .= 1/*N* for *i* = 1, . . . , *N*.

**Remark 6.** *The resampling method applied in step (3.) of Algorithm 1 is known as* multinomial resampling*. Alternative methods that are guaranteed to produce lower Monte Carlo variance exists, cf. Douc et al. (2005). We consider multinomial resampling for its analytical tractability, and recommend applying one of the more efficient alternatives in practice.*

**Remark 7.** *The notation x* (*i*) 1:*<sup>t</sup> is ambiguous due to the resampling step of Algorithm 1, since the elements of the ith particle path at time t* − 1*, denoted x* (*i*) 1:*t*−1*, are not necessarily the same as the first <sup>t</sup>* <sup>−</sup> <sup>1</sup> *elements of the ith particle path at time t, denoted x* (*i*) 1:*t . By convention, x* (*i*) 1:*<sup>t</sup> always refers to the particle chain after resampling at* *time t (similarly x*˜ (*i*) 1:*<sup>t</sup> refers to the chain before resampling). We refer to elements k to l of the ith particle chain after resampling at time t as x*(*i*) *l*:*k*, *t .*

The particle filter in Algorithm 1 produces two particle samples at each point in time, *t*. The first set, {*x*˜ (*i*) 1:*t*}*<sup>N</sup> <sup>i</sup>*=1, is produced at the propagation step (1.) and is associated with importance weights in the weighting step (2.), {*w*¯ (*i*) *<sup>t</sup>* }*<sup>N</sup> <sup>i</sup>*=1. The second set, {*x* (*i*) 1:*t*}*<sup>N</sup> <sup>i</sup>*=1, is produced at the resampling step (3.). Both sets are approximately drawn from the smoothing density (28). We note the resampling step introduces additional sampling error, cf. Chopin (2004), so we calculate approximations using the weighted sample unless otherwise specified.

The particle filter iterates over over the propagation, weighting and resampling steps throughout the sequence, *t* = 1, ... , *T*, after which the algorithm terminates. We note the two sets of particles produced during each iteration are themselves random variables measurable with respect to the sub-*σ*-algebras <sup>F</sup>˜*<sup>t</sup>* and <sup>F</sup>*t*, defined next.

**Definition 1.** *Define the sub-σ-algebras* <sup>F</sup>˜*<sup>t</sup> . .*= F*t*−<sup>1</sup> ∪ *σ*(*x*˜ (1) *<sup>t</sup>* , ... , *x*˜ (*N*) *<sup>t</sup>* )*,* F*<sup>t</sup> . .*<sup>=</sup> <sup>F</sup>˜*<sup>t</sup>* <sup>∪</sup> *<sup>σ</sup>*(*<sup>x</sup>* (1) *<sup>t</sup>* , ... , *x* (*N*) *<sup>t</sup>* ) *for t* = 1, . . . , *T, initialized by* F<sup>0</sup> *. .*= ∅*.*

At each point in time, we associate an empirical measure with the weighted particle sample generated by the propagation (1.) and reweighting (2.) steps in Algorithm 1. Formally, for *t* = 1, 2, . . . , *T*, we define the empirical measure,

$$\bar{p}\_{\theta}^{N}(\mathbf{dx}\_{1:t} \mid y\_{1:t}) := \sum\_{i=1}^{N} \bar{w}\_{t}^{(i)} \delta\_{\mathcal{A}\_{1:t}^{(i)}}(\mathbf{dx}\_{1:t}) \,, \tag{59}$$

where *<sup>δ</sup>x*(d*x*) denotes the point measure at *<sup>x</sup>* ∈ R*<sup>p</sup>* with respect to d*x*. The weighted particles that constitute the empirical measure (59) are approximately distributed according to the smoothing density (28). We emphasize the weighted particles are not independent draws from (28), because the resampling step introduces dependence between the particles at each iteration of the algorithm. We use the empirical measure (59) to define a particle filter-based approximation of the intractable smoothing problem in (27),

$$\mathbb{E}\_{\theta}^{N}[\gamma\_{t}(\mathbf{x}\_{1:t}) \mid y\_{1:t}] := \int \gamma\_{t}(\mathbf{x}\_{1:t}) \mathbb{P}\_{\theta}^{N}(\mathbf{dx}\_{1:t} \mid y\_{1:t}) = \sum\_{i=1}^{N} \overline{w}\_{t}^{(i)} \gamma\_{t}(\check{\mathbf{x}}\_{1:t}^{(i)}) \,. \tag{60}$$

for any point in time *t* ∈ {1, ... , *T*}. Due to dependence between the weighted particles, we cannot establish the asymptotic properties of the approximation (60) based on the law of large numbers and central limit theorem for independent random variables. For appropriately chosen test functions *γt*(*x*1:*t*), the approximation (60) is both consistent and asymptotically Gaussian as the number of particles tends to infinity, *N* → ∞, cf. Theorem 9.4.5 in Cappé et al. (2005).

The particle filter in Algorithm 1 also produces an approximation of the log-likelihood function (8) evaluated at the parameter value *θ* and the observation sequence *y*1:*T*,

$$\mathbb{Z}\_T^N(\theta) := \sum\_{t=1}^T \log \mathcal{W}\_t^N \,. \tag{61}$$

We note that the approximate log-likelihood function (61) consists of the logarithm of the product of normalizing constants produced by Algorithm 1. The approximate log-likelihood (61) is consistent in the sense that it converges in probability to the true log-likelihood function, as the number of particles tends to infinity, see Lemma 4.

**Lemma 4.** *For the model (1)–(3) and θ* ∈ Θ*, the approximate log-likelihood function (61) produced by Algorithm 1 is a consistent estimator of the true log-likelihood (8),*

$$\begin{array}{ccc}\bar{\ell}\_T^N(\theta) & \stackrel{P}{\rightarrow} & \mathcal{E}\_T(\theta) \end{array} , \tag{62}$$

*as N* → ∞*.*

In addition to producing an approximation of the intractable log-likelihood function (8), we apply the approximation (60) of the intractable smoothing problem in (27) to produce approximations of the sample score and observed Information matrix via Fisher's and Louis' identities in Lemma 2.

#### *6.2. The Approximate Sample Score and Observed Information Matrix*

We showed in Section 5 that the sample score and observed Information matrix can be expressed in terms of smoothing problems of the form (27). Appealing to Fisher's identity (37) in Lemma 2, and to the approximation of the smoothing problem (60), we define the particle filter-based approximate sample score as,

$$\tilde{S}\_T^N(\theta) := \sum\_{i=1}^N \mathcal{U}\_T(\tilde{\mathbf{x}}\_{1:T}^{(i)}; \theta) \tilde{w}\_T^{(i)},\tag{63}$$

for any parameter *θ* ∈ Θ, with the function *UT*(*x*1:*T*; *θ*) as defined in (33). If Conjecture 2 holds, then the approximate sample score in (63) is both consistent and asymptotically normal.

**Lemma 5.** *If Conjecture 2 holds and θ* ∈ Θ*, then the approximate sample score (63) is asymptotically normal,*

$$\sqrt{N}\left\{\mathbb{S}\_{T}^{N}(\boldsymbol{\theta})-\mathbb{S}\_{T}(\boldsymbol{\theta})\right\} \quad \stackrel{D}{\rightarrow} \quad N(0, \mathbb{S}\_{T}[\boldsymbol{L}\_{T}(\mathbf{x}\_{1:T}; \boldsymbol{\theta})])\,,\tag{64}$$

*as <sup>N</sup>* <sup>→</sup> <sup>∞</sup>*. An intractable expression for the asymptotic covariance matrix* <sup>S</sup>˜ *<sup>T</sup>*[*UT*(*x*1:*T*; *<sup>θ</sup>*)] *is given in Lemma A.5 by setting t* = *T and γT*(*x*1:*T*) = *UT*(*x*1:*T*; *θ*)*.*

Similarly, by appealing to Louis' identity (38) in Lemma 2, and to the approximation of the smoothing problem (60), we define the particle filter-based approximate observed Information matrix as,

$$I\_T^N(\theta) := \tilde{S}\_T^N(\theta)\tilde{S}\_T^N(\theta)' - \tilde{G}\_T^N(\theta) - \mathbb{R}\_T^N(\theta) \,. \tag{65}$$

for any parameter *θ* ∈ Θ, where we define the approximations to (39) and (40) as

$$\tilde{G}\_T^N(\theta) := \sum\_{i=1}^N V\_T(\mathfrak{x}\_{1:T'}^{(i)}; \theta) \mathfrak{w}\_T^{(i)} \tag{66}$$

$$\mathcal{R}\_T^N(\theta) := \sum\_{i=1}^N \mathcal{U}\_T(\mathfrak{x}\_{1:T}^{(i)}; \theta) \mathcal{U}\_T(\mathfrak{x}\_{1:T}^{(i)}; \theta)' \mathfrak{w}\_T^{(i)},\tag{67}$$

and the functions *UT*(*x*1:*T*; *θ*) and *VT*(*x*1:*T*; *θ*) are defined in (33) and (34), respectively. If Conjecture 2 holds, then the approximate observed Information in (65) is consistent, stated in the following lemma.

**Lemma 6.** *If Conjecture 2 holds and θ* ∈ Θ*, then the approximate observed Information matrix (65) is consistent,*

$$I\_T^N(\theta) \quad \stackrel{P}{\rightarrow} \quad I\_T(\theta) \tag{68}$$

#### *as N* → ∞*.*

Both the approximate sample score (63) and observed Information matrix (65) are biased for finite *N*. This is a general issue related to the particle filter-based approximation of the smoothing problem (60). At each iteration, the particle filter in Algorithm 1 relies on an approximation of the normalized constant, i.e., likelihood contribution. This induces a finite-sample bias in (60) that gradually disappears as the number of particles *N* tends to infinity and is negligible for large enough *N*, cf. e.g., Robert and Casella (2010, sct. 3.3.2).

The particle filter-based approximation of the sample score (63) and observed Information matrix (65) correspond to a batch version of Algorithm A in Poyiadjis et al. (2011), which is of computational cost *O*(*N*), but exhibits quadratically increasing variance of the approximate sample score as a function of the sample size *T*. We note that Poyiadjis et al. (2011) also suggest an alternative algorithm, that exhibits linearly increasing variance as a function of *T*, but at the computational cost *O*(*N*2). For smaller sample sizes, such as monthly observations as usually encountered in economics, we have found that the *O*(*N*) algorithm is adequate.

#### **7. Particle Filter-Based Estimation and Inference**

In this section, we show how the approximate sample score (63) and observed Information matrix (65) can be used to perform parameter estimation and inference. We apply a stochastic approximation method based on the approximate sample score to approximate the ML estimator (19). This has recently been suggested in Poyiadjis et al. (2011). We then use the approximate observed Information matrix to obtain approximate standard errors for the approximate ML estimates. Although these quantities are 'approximate', we note that they can be made arbitrarily precise by increasing the number of particles, *N*, at the expense of increased computational effort.

Recall from Section 4 that the ML estimator (19) is doubly intractable. Consequently, we cannot apply gradient-based optimization algorithms to maximize the log-likelihood function (8). Originally proposed in Robbins and Monro (1951), stochastic approximation methods are conceptually similar to gradient-based optimization methods, but rely on noisy rather than exact evaluations of the sample score to optimize the objective function. The basic idea is that appropriately decreasing the step sizes provides an averaging of the random errors induced by the noisy evaluations of the sample score. For a book-length treatment of stochastic approximation, we refer to Kushner and Yin (2003).

The stochastic approximation algorithm proposed in Poyiadjis et al. (2011, sct. 3.1) consists of a recursion that is conceptually similar to the steepest descent method, cf. e.g., Nocedal and Wright (2006, chp. 3). Prior to executing the algorithm, we choose a fixed initial parameter value *<sup>θ</sup>*<sup>0</sup> <sup>∈</sup> <sup>Θ</sup>, a sequence of particle counts {*Nj*}<sup>∞</sup> *<sup>j</sup>*=1, a sequence of step sizes {*γj*}<sup>∞</sup> *<sup>j</sup>*=1, and a sequence of weight matrices {*Bj*}<sup>∞</sup> *<sup>j</sup>*=1. The particle counts must be monotonically increasing positive integers, the step sizes must be strictly positive, non-summable but square summable,

$$\sum\_{j=1}^{\infty} \gamma\_j = \infty \quad \text{and} \quad \sum\_{j=1}^{\infty} \gamma\_j^2 < \infty,\tag{69}$$

and the weight matrices must be positive definite. Having chosen the initial parameter, particle counts, step sizes, and weight matrices, we run the recursion,

$$\theta\_{\dot{\jmath}+1} = \theta\_{\dot{\jmath}} + \gamma\_{\dot{\jmath}} B\_{\dot{\jmath}} \mathbb{S}\_{T}^{N\_{\dot{\jmath}}}(\theta\_{\dot{\jmath}}) \,. \tag{70}$$

for *j* = 0, 1, ... , *K*. Here *K* has to be sufficiently large in the sense that the sequence of parameter values generated by the recursion (70) has stabilized in a neighborhood of the true ML estimate. Additionally, if the particle count *Nj* is large enough, the approximation error affecting the stochastic approximation recursion (70) will be approximately normal, cf. Lemma 5. In this case large disturbances will be rare, such that the parameter sequence {*θj*}*<sup>K</sup> <sup>j</sup>*=<sup>1</sup> is likely to stabilize without exhibiting large jumps.

We denote by {*x*˜ (*i*) 1:*T*, *j* , *w*¯ (*i*) *t*, *j* } *Nj <sup>i</sup>*=<sup>1</sup> the particle paths produced by the particle filter in Algorithm 1 at iteration *j* of the stochastic approximation recursion (70). The iteration index *j* is notationally identical to time index of the particle path, cf. Remark 7. Although this is abuse of notation, it is clear from the context whether we refer to the parameter iteration or particle path time index. The parameter *θj*+<sup>1</sup> produced by iteration *j* of (70) is a random variable that is measurable with respect to the sub-*σ*-algebra G*j*, defined next.

**Definition 2.** *Let* F*T*, *<sup>j</sup> . .*= *σ*(*x* (1) 1:*T*, *j* , ... , *x* (*Nj*) 1:*T*, *j* ) *denote the sub-σ-algebra in Definition 1 generated with the parameter value θj, and define the sub-σ-algebras* G*<sup>j</sup> . .*= G*j*−<sup>1</sup> ∪ F*T*, *<sup>j</sup> for j* = 1, . . .*, initialized by* G<sup>0</sup> *. .*= F*T*, 0*.*

One of the main benefits of the stochastic approximation method is that the method is known to stabilize for a wide variety of initial values, sample counts, step sizes, and weight matrices. In practice, all of these choices affect the number of iterations needed to bring the parameter sequence into the neighborhood of the true ML estimator. The choice of step sizes is particularly important, since large step sizes generally speed up the convergence, but fail to dampen the approximation-induced noise. Small step sizes reduce the noise, but cause slow convergence. The particle count has a similar effect, since a low number of particles will result in a computationally cheap but noisy approximation of the sample score, while a large number of particles reduces the noise but increases the computational cost. Heuristically, it is appropriate to use a combination of large step sizes and small particle counts until the parameter sequence has reached a neighborhood of the ML estimator, and then switch to a combination of smaller step sizes and larger particle counts to reduce the noise. The intuition is that, while far away from the ML estimator, a relatively noisy approximation of the sample score will still on average lead the algorithm in the right direction.

The presence of noise in the sample score is not an impediment when applying stochastic approximation, since the use of decreasing step sizes provides an averaging of the errors. However, the finite sample bias of the particle filter-based approximate sample score, cf. Section 6.2, poses a problem since its effect is not mitigated by decreasing the step sizes. Bias reduction is possible by increasing the particle count *Nj* together with the iteration number *j*.

The stochastic approximation method is presented in Algorithm 2 below.1

#### **Algorithm 2:** Stochastic Approximation.

Choose the initial parameter *<sup>θ</sup>*<sup>0</sup> <sup>∈</sup> <sup>Θ</sup>, the particle counts {*Nj*}<sup>∞</sup> *<sup>j</sup>*=1, the step sizes {*γj*}<sup>∞</sup> *<sup>j</sup>*=<sup>1</sup> and weighting matrices {*Bj*}<sup>∞</sup> *<sup>j</sup>*=1. For *j* = 0, 1, . . . , *K*:


$$\delta\_{T\_T}^{N\_j}(\theta\_j) = \sum\_{i=1}^{N\_j} \mathcal{U}\_T(\mathbf{x}\_{1:T,j}^{(i)}; \theta\_j) \mathfrak{w}\_{t,j}^{(i)}.\tag{71}$$

3. With step size *γj*, ascend along the direction *Bj*,

$$
\theta\_{\dot{\jmath}+1} = \theta\_{\dot{\jmath}} + \gamma\_{\dot{\jmath}} B\_{\dot{\jmath}} \tilde{S}\_{T}^{N\_{\dot{\jmath}}}(\theta\_{\dot{\jmath}}) \,. \tag{72}
$$

<sup>1</sup> We use the Choleski factorization to ensure positive definiteness of the covariance matrices Ω*u*, Λ and ΩΦ. Thus, we estimate the parameters *B*, *A*, Ω*<sup>u</sup>* = *CuC <sup>u</sup>*, *μ*, Φ, ΩΦ = *C*Φ*C* <sup>Φ</sup> and Λ = *C*Λ*C* <sup>Λ</sup> using Algorithm 2 and transform the covariances to the original parametrization. We obtain standard errors via the *δ*-method.

Polyak (1990) and Polyak and Juditsky (1992) showed that if the step sizes {*γj*}<sup>∞</sup> *<sup>j</sup>*=<sup>1</sup> satisfy the summability conditions (69) and tend to zero slower than *j* <sup>−</sup>1, then the average of the last *<sup>j</sup>* <sup>−</sup> *<sup>K</sup>*<sup>0</sup> iterations converges at an optimal rate. Here *K*<sup>0</sup> < *K* denotes the iteration number at which the averaging begins; implicitly, we discard the initial *K*<sup>0</sup> iterations. We define the approximate ML estimator as,

$$\vec{\theta}\_{T} := \frac{1}{K - K\_{0}} \sum\_{j=K\_{0}}^{K} \theta\_{j} \, , \tag{73}$$

suppressing the dependence on the particle count. Establishing convergence of the approximate ML estimator (73) to the true ML estimator (19) is outside the scope of this paper. However, if (73) converges in probability to (19) for any fixed *T*, then (73) inherits the consistency property, cf. Theorem 2, of the true ML estimator.

Convergence of the particle filter-based stochastic approximation method proposed in Poyiadjis et al. (2011) has, to the author's knowledge, not been studied yet. The finite-sample bias of the approximate sample score (63) presents the primary obstacle to establishing convergence results. Intuition suggests that increasing the number of particles *Nj* with the iteration number *j* solves the problem. However, convergence of such schemes has not been carefully established, cf. Douc et al. (2014, sct. 12.1.2). Poyiadjis et al. (2011) report stabilization of the particle filter-based stochastic approximation method with constant particle count. In Section 10, we report similar stabilization with increasing particle counts.

If the model is correctly specified, we would conduct inference on the ML estimator via the observed Information matrix, cf. Section 4. Analogously, since the approximate observed Information matrix (65) converges in probability to the true observed Information matrix (10), we can conduct inference for the approximate ML estimator (73) via the approximate observed Information matrix (65), the same way we would conduct inference given the true observed Information matrix (10).

#### **8. Model Diagnostics**

In this section, we introduce a method to conduct model diagnostics, such that we may assess whether the SSR model is well-specified for a given parameter *θ* and observation sequence *y*1:*T*. Recall that the disturbances *ut*, *η<sup>t</sup>* and *ν<sup>t</sup>* are normally distributed and serially independent with mean zero and unit variances. Because the components *ε<sup>t</sup>* and *ξ<sup>t</sup>* are hidden to us, we cannot directly compute the residuals corresponding to the disturbances. Instead, we introduce the normalized one-step prediction errors, cf. Durbin and Koopman (2012, sct. 2.12), that can be approximated via particle filtering. This approach to model diagnostics for state space models has also previously been considered in Pitt and Shephard (1999a).

We define the normalized one-step prediction errors as,

$$\mathfrak{e}\_t := \mathbb{V} \mathbf{a} \mathbf{r}\_\theta \begin{bmatrix} y\_t \end{bmatrix} y\_{1:t-1} \begin{bmatrix} \end{bmatrix}^{-1/2} \left( y\_t - \mathbb{E}\_\theta \begin{bmatrix} y\_t \ \mid \, y\_{1:t-1} \end{bmatrix} \right) \tag{74}$$

for *t* = 1, ... , *T*. For a well-specified model, the sequence of normalized one-step prediction errors should be serially independent with mean zero with unit variance. Any deviation from these characteristics are indicative of model misspecification.

The conditional mean and variance in (74) can be stated in terms of smoothing problems, where the test functions are the conditional mean and variance of the predictive observation density,

$$\mathbb{E}\_{\theta} \left[ y\_t \mid y\_{1:t-1} \right] = \mathbb{E}\_{\theta} \left[ \mathbb{E}\_{\theta} \left[ y\_t \mid x\_{t-1} \right] \mid y\_{1:t-1} \right] \; \tag{75}$$

$$\mathbb{V}\mathbf{\bar{\color{red}{ $\mathtt{Var}\_{\theta}$ }}}\left[y\_{t}\mid y\_{1:t-1}\right] = \mathbb{E}\_{\theta}\left[\mathbb{V}\mathbf{\bar{\color{red}{ $\mathtt{Var}\_{\theta}$ }}}\left[y\_{t}\mid \mathbf{x}\_{t-1}\right] \mid y\_{1:t-1}\right] + \mathbb{V}\mathbf{\bar{\color{red}{ $\mathtt{Var}\_{\theta}$ }}}\left[\mathbb{E}\_{\theta}\left[y\_{t}\mid \mathbf{x}\_{t-1}\right] \mid y\_{1:t-1}\right] \,. \tag{76}$$

We note that the conditional mean and variance of the predictive observation density are given in Lemma 3. Using the locally optimal particle filter in Algorithm 1, we define approximations to (75) and (76) as

$$\mathbb{E}\_{\theta}^{N}\left[y\_{t}\mid y\_{1:t-1}\right] := \sum\_{i=1}^{N} \mathbb{H}\_{t\mid t-1}^{y\_{t}(i)} \mathbb{D}\_{t-1}^{(i)}\tag{77}$$

$$\sqrt[N]{\text{Var}}\_{\theta}^{-N} \left[ y\_t \mid y\_{1:t-1} \right] := \sum\_{i=1}^{N} \tilde{\Sigma}\_{t|t-1}^{y\_t(i)} \tilde{w}\_{t-1}^{(i)} + \sum\_{i=1}^{N} (\tilde{\mu}\_{t|t-1}^{y\_t(i)}) (\tilde{\mu}\_{t|t-1}^{y\_t(i)})' \tilde{w}\_{t-1}^{(i)} - \tilde{\mathbb{E}}\_{\theta}^{N} \left[ y\_t \mid y\_{1:t-1} \right] \tilde{\mathbb{E}}\_{\theta}^{N} \left[ y\_t \mid y\_{1:t-1} \right]', \tag{78}$$

respectively, where we have defined the conditional moments given each individual particle as,

$$\mathbb{E}\_{t|t-1}^{y\_t(i)} := \mathbb{E}\_{\theta}[y\_t \mid \mathbb{x}\_{t-1}^{(i)}] \tag{79}$$

$$\Sigma\_{t|t-1}^{y\_\prime(i)} := \mathsf{Var}\_{\theta}[y\_t \mid \mathfrak{x}\_{t-1}^{(i)}] \,. \tag{80}$$

for *i* = 1, ... , *N*. Finally, we use the approximations (77) and (78) to define the approximate normalized likelihood contributions as follows,

$$\tilde{e}\_{t}^{N} := \overleftarrow{\mathbb{V}} \overset{\sim}{\mathbf{ar}}\_{\theta}^{\sim N} \left[ y\_{t} \mid y\_{1:t-1} \right]^{-1/2} \left( y\_{t} - \mathbb{E}\_{\theta}^{N} \left[ y\_{t} \mid y\_{1:t-1} \right] \right) \, , \tag{81}$$

for *t* = 1, ... , *T*. Thus, by applying the particle filter in Algorithm 1, we obtain the sequence of approximate normalized one-step prediction errors *e*˜1:*<sup>T</sup>* via (77)–(81). For *N* sufficiently large, we can use the sequence *e*˜1:*<sup>T</sup>* to test whether the true sequence of normalized one-step prediction errors *e*1:*<sup>T</sup>* is serially independent with mean zero and unit variance. For common tests for serial dependence and ARCH effects see e.g., Doornik and Hendry (2013, sct. 11.9.2–3).

#### **9. Simulation Study**

In this section, we conduct a simulation study of the asymptotic properties of the ML estimator, stated in Theorem 2. We limit our treatment to *B*, *A*, Φ and ΩΦ, leaving aside the remaining parameters Ω*u*, *μ*, Ω*η*, Ω*<sup>ν</sup>* and Ω*η*, *<sup>ν</sup>*. Recall, the loading matrix for the stationary components *A* is conjectured to be asymptotically normal, while the loading matrix of the nonstationary components *B* is kept fixed. Due to the results of Chang et al. (2009), we expect the asymptotic distribution of *B* to be mixed normal, and we tentatively investigate this. Moreover, we consider the case where Φ*<sup>t</sup>* is a stochastic unit root. A deterministic unit root is associated with the Dickey-Fuller distribution, cf. Dickey and Fuller (1979), while a stochastic unit root has been shown to be asymptotically normal, see e.g., Ling (2007) and Bohn Nielsen and Rahbek (2014).

Recall, Theorem 2 is based on the conjectured properties of the true, intractable log-likelihood function and its derivatives, cf. Conjecture 1. The aim is to substantiate this conjecture by obtaining the distribution of the approximate ML estimator based on simulated data sets. Usually, the number of realizations in a simulation study of this type is in excess of 1000 and the sample length in excess of 2500 observations. Due to the computational intensity of the particle filter-based stochastic approximation method in Algorithm 2, we limit ourselves to 250 realizations and 500 observations.

We let each of the simulated data sets be a bivariate *p* = 2 series of length *T* = 500 observations with *r* = 1 stationary component and *p* − *r* = 1 nonstationary component. We use the parameter

$$B = \begin{bmatrix} 1 \\ 1 \end{bmatrix}, \quad A = \begin{bmatrix} 0 \\ 1 \end{bmatrix}, \quad \Omega\_{\boldsymbol{\upmu}} = \begin{bmatrix} 2.5^2 & 0 \\ 0 & 2.5^2 \end{bmatrix}, \tag{82}$$

$$
\mu = 0, \quad \phi = 1, \quad \omega\_{\phi}^2 = 0.25^2, \quad \omega\_{\eta}^2 = 15^2, \quad \omega\_{\eta, \nu} = 0, \quad \text{and} \quad \omega\_{\nu}^2 = 2.5^2,\tag{83}
$$

to generate the simulated data sets. We note the parameter values (83) result in a top Lyapunov coefficient of *<sup>γ</sup><sup>n</sup>* <sup>=</sup> <sup>−</sup>0.035, computed via (15) with *<sup>n</sup>* <sup>=</sup> 106, such that the RCAR process {*ξt*}*t*=0, 1, ... is strictly stationary.

Having simulated 250 series with the data generating process given by (1)–(3) and (83), we apply Algorithm 2 with *K* = 600 iterations to obtain the approximate ML estimate for the parameter in question, e.g., *φ*, keeping all other parameters fixed at the true values in (83). We initialize the algorithm at the true parameter value, and initiate Polyak averaging at iteration *K*<sup>0</sup> = 100.2 Moreover, we let the particle count increase as

$$N\_{\circ} = 50 + \lfloor 1/20j \rfloor \, , \tag{84}$$

where !·" denotes the largest integer that is smaller than the argument. We let the step size sequence to decrease as

$$
\gamma\_j = 100(j + 500)^{-2/3},
\tag{85}
$$

and set the weight matrix to

$$B\_{\hat{f}} = T^{-1} \text{diag} \left( \begin{bmatrix} 1 \ 0^{-5} & 1 & 1 & 1 & 1 & 10^{-2} & 1 & 1 & 1 & 10^{-3} \ 
\end{bmatrix} \right) \tag{86}$$

for *j* = 1, 2, ... , *K*. Note the particle count (84) tends to infinity as *j* → ∞, eliminating the finite-sample bias of (63)–(65), the step sizes satisfy (69), and the weight matrix is constant.3

The results from the simulation experiment are presented in Figure 1. Despite the relatively low number of realizations and observations, Figure 1 is instructive of the asymptotic distributions of *A*1, *φ* and *ω*<sup>2</sup> *<sup>φ</sup>*, cf. Panels (a), (c) and (d). These all appear to be normal. Recall, Theorem 2 does not state the asymptotic distribution of the ML estimator for *B*2, and from Panel (b) it does not appear to be normal. Rather, the realizations in Panel (b) are consistent with mixed normality, as we would expect from the closely-related CST model, cf. Chang et al. (2009). To investigate further, one could to simulate the *t*-ratios of *B*2, which should be standard normal. This involves the approximation of the observed Information matrix for each realization, which further increases the computational cost. For this reason, and because we consider *B* fixed, we do not pursue this further here.

<sup>2</sup> Because we initialize at the true parameter value, the parameter sequences stabilize within the first 100 realizations. Using *K* = 600 iterations is sufficient to reduce the impact of the approximation error.

<sup>3</sup> The choice of weight matrix is based on hand-tuning the convergence speed of Algorithm 2 by running a small number of trial-and-error runs with *N* = 50 particles and constant step size *γ* = 1.

**Figure 1.** Simulation study with 250 realizations of the approximate MLE for *A*1, *B*2, *φ*, and *ω*<sup>2</sup> *φ*.

In summary, the findings of the simulation study tentatively support the conjecture made in Section 4. Namely, the ML estimator for *A*, Φ and ΩΦ is asymptotically normal. The ML estimator for *B*<sup>2</sup> appears to be consistent with mixed normality. We have not investigated the remaining parameters.

#### **10. An Illustration**

In this section, we illustrate the use of the SSR model by applying it to the monthly 10-year government bond rates for Germany and Greece from January 1999 to February 2018.4 We denote the German and Greek bond rates *yGE* and *yGR*, respectively, and measure these in basis points per year. The sample begins at the introduction of the euro area and ends at present day. During this period, the rates initially exhibit convergence towards a common 'euro area rate', until interrupted by the euro area crisis beginning in 2009 and culminating in 2011. The rates, the spread and the changes in the spread are illustrated in Figure 2 below. Because the spread is up to 75 times larger during the second half of the sample than during the first half, we split the display of the sample into the first and second half, respectively.

Panels (a) and (b) in Figure 2 show the bond rates, Panels (c) and (d) show the spread, and Panels (e) and (f) show the changes in the spread in the two periods. We note two features of the observations. First, Panel (a) suggests the rates can be characterized by a shared common stochastic trend, since these tend to move in tandem. Second, Panels (d) and (f) suggest the spread can be characterized by a RCAR process, since the changes in the spread, cf. Panel (f), are clearly positively associated with the level of the spread itself, cf. Panel (d).

<sup>4</sup> Obtained via a Bloomberg LP Terminal using the ticker codes 'GDBR10 Index' and 'GGGB10YR Index'.

**Figure 2.** German and Greek 10-year government bond rates, spread and changes in the spread. Monthly observations in basis points from January 1999 to February 2018.

We define the observation vector as *yt* . .= [ *yGE <sup>t</sup> yGR <sup>t</sup>* ] . We condition on the observation for January 1999, which we denote *y*0, such that the effective sample spans *t* = 1, ... , 229. From visual inspection of Figure 2, our working assumption is that the spread *yGR <sup>t</sup>* <sup>−</sup> *<sup>y</sup>GE <sup>t</sup>* is strictly stationary, while the rates *yt* share a common stochastic trend. With a *p* = 2 dimensional system, we thus have *r* = 1 stationary component and *p* − *r* = 1 nonstationary component. Moreover, we fix *B* = [ 1 1 ] , such that the orthogonal complement *b* = [ −1 1 ] produces the spread. To ensure the model is just-identified, we normalize on the second element of *A*, such that *A*<sup>2</sup> = 1.

We apply the particle filter-based stochastic approximation method in Algorithm 2 to obtain the approximate ML estimate of the model parameter *θ*. For this illustration, we run the algorithm for *K* = 10, 000 iterations. We let the particle count increase as (84), the step size sequence decrease as (85), and the weighting matrix as (86). We initiate Polyak averaging at iteration *K*<sup>0</sup> = 5000.

**Figure 3.** Parameter and log-likelihood sequences from stochastic approximation with *K* = 10, 000 iterations. We also show a moving average of lag order 500 for the log-likelihood sequence. To avoid large differences in the scales of the displayed sequences, we have scaled the sequences for *A*1, *ω*<sup>2</sup> *η*, *ωη*, *ν*, *ων*, and *ωφ* by 100, 1/300, 1/50, 1/50, and 2 respectively.

Figure 3 shows the results of running the particle filter-based stochastic approximation method. Panel (a) displays the iterations for the parameters in the observation Equation (22), Panel (b) displays the iterations for the parameters in the transition Equation (23), and Panel (c) displays the sequence of realized approximate log-likelihoods together with a moving average of lag order 500. The algorithm has been implemented in the Ox 7 programming language, cf. Doornik (2012), using analytical derivatives of the complete data log-likelihood (32) for the evaluation of the function (33). The elements of the parameter sequence shown in Panels (a) and (b) have stabilized after the initial 7500 iterations. At the 10, 000th iteration, the particle count has increased to 550, the step size decreased to 0.2085, and the sequences have stabilized. By inspection of the sequence of the approximate log-likelihood in Panel (c), we see that the value has also stabilized after approximately 7500 iterations.

The estimation results are presented in Table 1, together with approximate classic standard errors.<sup>5</sup> Before considering inference, we assess the model fit. We compute the normalized one-step prediction errors *e*˜ *N* 1:*<sup>T</sup>* via (81) using *N* = 1000 particles. Table 2 presents univariate tests for autocorrelation (AR) of order one and two, autoregressive conditional heteroskedasticity (ARCH) of order one, and a multivariate test for AR of order one and two, cf. Doornik and Hendry (2013, sct. 11.9.2–3). We cannot reject the null hypothesis of no-AR of order one and two in the univariate as well as multivariate tests at a 5% critical level. Nor can we reject the null hypothesis of no-ARCH for the residuals at a 5% critical level. However, we note the test for the German rate is close to, but below, our chosen critical level. This could suggest unmodeled heteroskedasticity in the German bond rate. In conclusion, the overall specification of the model is acceptable. Moreover, computing the top Lyapunov coefficient via (15) with *<sup>n</sup>* <sup>=</sup> 105 produces a coefficient of *<sup>γ</sup>*ˆ*<sup>n</sup>* <sup>=</sup> <sup>−</sup>0.007, which indicates the stationary direction is strictly stationary for ˜ *θT*.


**Table 1.** Approximate ML estimate, ˜ *θT*.

Note: The approximate log-likelihood is ˜ -*<sup>T</sup>* = −2094.1. The approximate ML estimate has been obtained by running Algorithm 2 for *K* = 10, 000 iterations with the particle count increasing to *N* = 550 particles, as described in the main text. The standard errors are based on the inverse of the approximate observed Information matrix computed with *N* = 1000 particles.



Note: The approximate normalized one-step prediction errors *e*˜1:*<sup>T</sup>* have been computed with *N* = 1000 particles for the approximate ML estimate ˜ *θT*, cf. Table 1.

The model is reasonably well-specified, and we therefore proceed to use the approximate classic standard errors to conduct inference on the approximate ML estimates. First, we note the standard error of the estimate of *A*<sup>1</sup> is extremely small. Since the test for no-ARCH for the residuals associated with the German rate is rejected at the 5% critical level, this could affect the approximate classic standard errors.6 Nevertheless, it is economically plausible that the stationary component also loads into the German rate, given that a large increase in the Greek rate would in this case coincide with a small drop in the German rate, which is consistent with risk-averse investors seeking safer assets in times of uncertainty, such as the euro area crisis. Second, we cannot reject the null hypothesis that *H*<sup>0</sup> : *φ* = 1 at a 5% critical level with *p* = 0.577. Third, the estimate of *ω*<sup>2</sup> *<sup>φ</sup>* is significantly different from

<sup>5</sup> The difference between computing the classic standard errors with *N* = 1000 and *N* = 10, 000 particles is negligible.

<sup>6</sup> Particle filter-based approximate robust standard errors have been suggested in Doucet and Shephard (2012), but we do not pursue this idea further in the present context.

zero at any commonly used critical level. However, the constant term *μ* is not significantly different from zero with *p* = 0.533. Fourth, the measurement errors are highly positively correlated with coefficient 0.961, and the innovations of the unobserved components are highly negatively correlated with coefficient −0.876. The results in Table 1 suggest the level of the stationary direction is a stochastic unit root process without a constant term. An approximate likelihood ratio test for the joint null hypothesis *H*<sup>0</sup> : *φ* = 1, *μ* = 0 fails to reject the null at a 5% critical level with *p* = 0.374.

Based on the estimates in Table 1, we use the orthogonal complements *b* and *a* to compute the changes of the nonstationary and stationary components, given by *b* Δ*yt* and *a* Δ*yt*, respectively. These are illustrated in Figure 4. First, we note the magnitude of the changes in Panels (a) and (b) of Figure 4 are slightly larger during the second half of the sample than during the first (standard deviations 18.01 and 20.16, respectively). Otherwise, the series in Panels (a) and (b) in Figure 4 are consistent with a homoskedastic random walk plus measurement error, cf. (17). The magnitude of the changes in Panels (c) and (d) of Figure 4 is positively associated with the level, just as observed in Figure 2. This is consistent with a random coefficient autoregressive process plus measurement error, cf. (18).

**Figure 4.** Changes in the nonstationary *b yt* and stationary *a yt* components.

Summarizing, the empirical illustration suggests that the SSR model successfully characterizes the 10-year government bond rates for Germany and Greece during the period from January 1999 to February 2018. During this sample, the spread exhibits bubble-like behavior, which is captured by the random coefficient autoregressive dynamics of the stationary component. Additionally, the levels exhibit a shared common stochastic trend, which is captured by the random walk dynamics of the nonstationary component.

#### **11. Conclusions**

In this paper, we have proposed and studied the stochastic stationary root model, which is a multivariate nonlinear state space model. We introduced particle filter-based approximations of the intractable log-likelihood function, sample score and observed Information matrix. In turn, we used these to approximate the ML estimator via stochastic approximation, and showed how to perform inference via the approximate observed Information matrix. We considered model diagnostics to assess the model fit. Additionally, we conducted a simulation study to investigate the asymptotic properties of the ML estimator. Finally, we presented an empirical application to the 10-year government bond rates in Germany and Greece in the period from January 1999 to February 2018 to illustrate the usefulness of the SSR model.

**Acknowledgments:** The author gratefully acknowledges comments by two anonymous referees that have led to substantial improvements of the paper. The author also thanks the editors, Rocco Mosconi and Paolo Paruolo, for constructive feedback, and the assistant editor, Lu Liao, for assisting in the publication process. Finally, the author would like to thank Anders Rahbek, Michael Pitt, Siem Jan Koopman, Heino Bohn Nielsen, Katarina Juselius, Søren Johansen, Simon Hetland, Gareth Roberts, Adam Johansen, Axel Finke, and Anthony Lee for helpful comments and dicsussions. Part of the work was undertaken while the author was a PhD student at the Department of Economics at the University of Copenhagen and part of the work was undertaken while the author was a CRiSM Research Fellow at the Department of Statistics at the University of Warwick. While at the University of Warwick, funding from the 36 Engineering and Physical Sciences Research Council (EPSRC) is gratefully acknowledged (Grant EP /D002060/1). All errors and omissions are the sole responsibility of the author.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **Appendix A. Auxiliary Results**

**Lemma A.1.** *For the SSR model (1)–(3) with θ* ∈ Θ*, it holds that*


*for any t* ∈ {1, . . . , *T*}*.*

**Proof of Lemma A.1.** By Corollary <sup>1</sup> we have that *<sup>p</sup>ω*(*yt* <sup>|</sup> *xt*)*pλ*(*xt* <sup>|</sup> *xt*−1) <sup>d</sup>*xt* <sup>=</sup> *<sup>p</sup><sup>θ</sup>* (*yt* <sup>|</sup> *xt*−1) is Gaussian, and therefore strictly positive for all *xt*−<sup>1</sup> ∈ R*<sup>p</sup>* and *<sup>θ</sup>* <sup>∈</sup> <sup>Θ</sup>, which yields part (*i*). Moreover, because the observation density (25) is Gaussian with constant and non-singular covariance matrix, we obtain part (*ii*).

**Lemma A.2.** *For the SSR model (1)–(3) with <sup>θ</sup>* <sup>∈</sup> <sup>Θ</sup>*, the model likelihood <sup>p</sup><sup>θ</sup>* (*y*1:*T*) *is strictly positive and finite,*

$$0 < p\_{\theta}(y\_{1:T}) < \infty. \tag{A1}$$

**Proof of Lemma A.2.** Preliminarily, we observe the likelihood in (A1) can equivalently be written in terms of the complete data likelihood *p<sup>θ</sup>* (*y*1:*T*, *x*1:*T*),

$$p\_{\theta}(y\_{1:T}) = \int p\_{\theta}(y\_{1:T\prime}|x\_{1:T}) \, \mathrm{d}x\_{1:T\prime} \,\tag{A2}$$

which, by the state space structure of the model, cf. (25)–(26), is equivalently

$$p\_{\boldsymbol{\theta}}(\boldsymbol{y}\_{1:T}) = \int \prod\_{t=1}^{T} p\_{\boldsymbol{\omega}}(\boldsymbol{y}\_{t} \mid \boldsymbol{x}\_{t}) p\_{\boldsymbol{\lambda}}(\boldsymbol{x}\_{t} \mid \boldsymbol{x}\_{t-1}) \, d\boldsymbol{x}\_{1:T} \,. \tag{A3}$$

By Lemma A.1.(*i*) and (A3), we have that the likelihood in (A1) is strictly positive, since

$$\int \prod\_{t=1}^{T} p\_{\omega}(y\_t \mid \mathbf{x}\_t) p\_{\lambda}(\mathbf{x}\_t \mid \mathbf{x}\_{t-1}) \, \mathrm{d}x\_{1:T} > 0 \,. \tag{A4}$$

Moreover, by Lemma A.1.(*ii*), the likelihood in (A1) is also finite, since

$$\begin{split} &\int \prod\_{t=1}^{T} p\_{\omega}(y\_t \mid \mathbf{x}\_t) p\_{\lambda}(\mathbf{x}\_t \mid \mathbf{x}\_{t-1}) \, \mathrm{d}\mathbf{x}\_{1:T} \\ &\leq \prod\_{t=1}^{T} \sup\_{\mathbf{x}\_t \in \mathbb{R}^{\mathcal{F}}} p\_{\omega}(y\_t \mid \mathbf{x}\_t) \int \prod\_{t=1}^{T} p\_{\lambda}(\mathbf{x}\_t \mid \mathbf{x}\_{t-1}) \, \mathrm{d}\mathbf{x}\_{1:T} \\ &= \prod\_{t=1}^{T} \sup\_{\mathbf{x}\_t \in \mathbb{R}^{\mathcal{F}}} p\_{\omega}(y\_t \mid \mathbf{x}\_t) < \infty, \end{split} \tag{A5}$$

which completes the proof of Lemma A.2.

**Lemma A.3.** *For the model (1)–(3) with θ* ∈ Θ*, it holds that*


*for t* ∈ {1, . . . , *T*}

**Proof of Lemma A.3.** We preliminarily note that the locally optimal transition density (46) can be written as

$$p\_{\theta}(\mathbf{x}\_{t}\mid\mathbf{x}\_{t-1},\mathbf{y}\_{t}) = \frac{p\_{\omega}(\underline{\mathbf{y}}\_{t}\mid\mathbf{x}\_{t})p\_{\lambda}(\mathbf{x}\_{t}\mid\mathbf{x}\_{t-1})}{p\_{\theta}(\underline{\mathbf{y}}\_{t}\mid\mathbf{x}\_{t-1})},\tag{A6}$$

where the predictive observation density is given by the integral,

$$p\_{\boldsymbol{\theta}}(\boldsymbol{y}\_{t}\mid\boldsymbol{x}\_{t-1}) = \int p\_{\boldsymbol{\omega}}(\boldsymbol{y}\_{t}\mid\boldsymbol{x}\_{t}) p\_{\boldsymbol{\lambda}}(\boldsymbol{x}\_{t}\mid\boldsymbol{x}\_{t-1}) \,\mathrm{d}\boldsymbol{x}\_{t} \,. \tag{A7}$$

By (A6) and the definition of absolute continuity, part (*i*) states that for every Borel-measurable set A∈B(R*p*), it holds that

$$\int\_{\mathcal{A}} \frac{p\_{\omega}(y\_t \mid \mathbf{x}\_t) p\_{\lambda}(\mathbf{x}\_t \mid \mathbf{x}\_{t-1})}{p\_{\theta}(y\_t \mid \mathbf{x}\_{t-1})} \, \mathbf{d}\mathbf{x}\_t = 0 \quad \Longrightarrow \quad \int\_{\mathcal{A}} p\_{\omega}(y\_t \mid \mathbf{x}\_t) p\_{\lambda}(\mathbf{x}\_t \mid \mathbf{x}\_{t-1}) \, \mathbf{d}\mathbf{x}\_t = 0. \tag{A8}$$

By (A7) and Lemma A.1.(*i*), we know the predictive observation density is strictly positive *<sup>p</sup><sup>θ</sup>* (*yt* <sup>|</sup> *xt*−1) <sup>&</sup>gt; <sup>0</sup> for all *xt*−<sup>1</sup> ∈ R*<sup>p</sup>* and *<sup>θ</sup>* <sup>∈</sup> <sup>Θ</sup>. Therefore (A8) is true for all *xt*−<sup>1</sup> ∈ R*<sup>p</sup>* and *<sup>θ</sup>* <sup>∈</sup> <sup>Θ</sup>, and part (*i*) holds.

To show part (*ii*), we first use (A6) to write

$$\frac{p\_{\omega}(y\_t \mid \mathbf{x}\_t) p\_{\lambda}(\mathbf{x}\_t \mid \mathbf{x}\_{t-1})}{p\_{\theta}(\mathbf{x}\_t \mid \mathbf{x}\_{t-1}, y\_t)} = p\_{\theta}(y\_t \mid \mathbf{x}\_{t-1}) \,, \tag{A9}$$

where, by Corollary 1, we have that *p<sup>θ</sup>* (*yt* | *xt*−1) is Gaussian and therefore strictly positive for all *xt*, *xt*−<sup>1</sup> ∈ R*<sup>p</sup>* × R*<sup>p</sup>* and *<sup>θ</sup>* <sup>∈</sup> <sup>Θ</sup>, and part (*ii*) holds.

Part (*iii*) follows from *p<sup>θ</sup>* (*xt* | *xt*−1, *yt*) being Gaussian, cf. Lemma 3, and therefore strictly positive for all *xt*−<sup>1</sup> ∈ R*p*. Thus, part (*iii*) holds.

**Lemma A.4.** *If <sup>θ</sup>* <sup>∈</sup> <sup>Θ</sup> *and <sup>γ</sup>t*(*x*1:*t*) <sup>∈</sup> *<sup>L</sup>*1[R*tp*, *<sup>p</sup><sup>θ</sup>* (*x*1:*<sup>t</sup>* <sup>|</sup> *<sup>y</sup>*1:*t*)]*, then it holds that the approximation (60) is consistent,*

$$\mathbb{E}\_{\theta}^{N}\left[\gamma\_{t}(\mathbf{x}\_{1:t}) \mid y\_{1:t}\right] \stackrel{P}{\rightarrow} \quad \mathbb{E}\_{\theta}\left[\gamma\_{t}(\mathbf{x}\_{1:t}) \mid y\_{1:t}\right] \tag{A10}$$

*for any t* ∈ {1, . . . , *T*}*, as N* → ∞*.*

**Proof of Lemma A.4.** We apply Theorem 9.4.5.(i) in Cappé et al. (2005) by verifying its conditions, i.e., Assumptions 9.4.1–3. We note the theorem is stated for scalar test functions, but generalizes to higher-dimensional test functions. Assumptions 9.4.1–2 is hold by Lemma A.1, while Assumption 9.4.3 holds by Lemma A.3. Thus, the conditions for Theorem 9.4.5.(i) in Cappé et al. (2005) are satisfied, which completes the proof of Lemma A.4.

**Lemma A.5.** *If <sup>θ</sup>* <sup>∈</sup> <sup>Θ</sup> *and <sup>γ</sup>t*(*x*1:*t*) <sup>∈</sup> *<sup>L</sup>*2[R*tp*, *<sup>p</sup><sup>θ</sup>* (*x*1:*<sup>t</sup>* <sup>|</sup> *<sup>y</sup>*1:*t*)]*, then it holds that the approximation (60) is consistent and asymptotically normal,*

$$\sqrt{N} \left\{ \mathbb{E}\_{\theta}^{N} \left[ \gamma\_{t}(\mathbf{x}\_{1:t}) \mid y\_{1:t} \right] - \mathbb{E}\_{\theta} \left[ \gamma\_{t}(\mathbf{x}\_{1:t}) \mid y\_{1:t} \right] \right\} \quad \stackrel{D}{\rightarrow} \quad N (0, \mathbb{S}\_{t}[\gamma\_{t}(\mathbf{x}\_{1:t})]) \,,\tag{A11}$$

*for any <sup>t</sup>* ∈ {1, ... , *<sup>T</sup>*}*, as <sup>N</sup>* <sup>→</sup> <sup>∞</sup>*. Initialized by* <sup>S</sup>˜ <sup>0</sup> *. .*= 0*, the asymptotic covariance matrix* S˜*t*[*γt*(*x*1:*t*)] *is given by*

$$\begin{split} \mathbb{S}\_{t}[\gamma\_{t}(\mathbf{x}\_{1:t})] &= \mathbb{S}\_{t-1} \left[ \mathbb{E}\_{\mathbf{q},t} \left[ \left( \gamma\_{t}(\mathbf{x}\_{1:t}) - \mathbb{E}\_{\theta} \left[ \gamma\_{t}(\mathbf{x}\_{1:t}) \mid y\_{1:t} \right] \right) \frac{\overline{w}\_{t}(\mathbf{x}\_{1:t-1})}{\mathsf{p}\_{\theta} \left( y\_{t} \mid y\_{1:t-1} \right)} \mid \mathbf{x}\_{1:t-1} \right] \right] \\ &+ \mathsf{Var}\_{\theta} \left[ \mathbb{E}\_{\mathbf{q},t} \left[ \left( \gamma\_{t}(\mathbf{x}\_{1:t}) - \mathbb{E}\_{\theta} \left[ \gamma\_{t}(\mathbf{x}\_{1:t}) \mid y\_{1:t} \right] \right) \frac{\overline{w}\_{t}(\mathbf{x}\_{1:t})}{\mathsf{p}\_{\theta} \left( y\_{t} \mid y\_{1:t-1} \right)} \mid \mathbf{x}\_{1:t-1} \right] \mid y\_{1:t-1} \right] \\ &+ \mathbb{E}\_{\theta} \left[ \mathrm{Var}\_{\mathbf{q},t} \left[ \left( \gamma\_{t}(\mathbf{x}\_{1:t}) - \mathbb{E}\_{\theta} \left[ \gamma\_{t}(\mathbf{x}\_{1:t}) \mid y\_{1:t} \right] \right) \frac{\overline{w}\_{t}(\mathbf{x}\_{1:t})}{\mathsf{p}\_{\theta} \left( y\_{t} \mid y\_{1:t-1} \right)} \mid \mathbf{x}\_{1:t-1} \right] \mid y\_{1:t-1} \right], \end{split} \tag{A12}$$

*where, for any appropriately integrable function γ*(*x*1:*t*)*, we define the operators*

$$\mathbb{E}\_{q,t}\left[\gamma\left(\mathbf{x}\_{1:t}\right)\mid \mathbf{x}\_{1:t-1}\right] := \int \gamma\left(\mathbf{x}\_{1:t}\right) q\_{\theta}\left(\mathbf{x}\_{l}\mid f\_{1:\mathbf{x}-1}, g\_{1:t-1}\right) \, d\mathbf{x}\_{1:t} \tag{A13}$$

$$\nabla \mathbf{x}\_{\mathsf{q},l} \left[ \boldsymbol{\gamma} (\mathbf{x}\_{1:t}) \mid \mathbf{x}\_{1:t-1} \right] := \mathbb{E}\_{\mathsf{q},l} \left[ \boldsymbol{\gamma} \left( \mathbf{x}\_{1:t} \right) \boldsymbol{\gamma} \left( \mathbf{x}\_{1:t} \right)' \mid \mathbf{x}\_{1:t-1} \right] - \mathbb{E}\_{\mathsf{q},l} \left[ \boldsymbol{\gamma} \left( \mathbf{x}\_{1:t} \right) \mid \mathbf{x}\_{1:t-1} \right] \mathbb{E}\_{\mathsf{q},l} \left[ \boldsymbol{\gamma} \left( \mathbf{x}\_{1:t} \right) \mid \mathbf{x}\_{1:t-1} \right]', \quad \text{(A14)}$$

*omitting dependence on θ.*

**Proof of Lemma A.5.** We apply Theorem 9.4.5.(ii) in Cappé et al. (2005) by verifying its conditions, i.e., Assumptions 9.4.1–3. Similar to the proof of Lemma A.4, we note the theorem is stated for scalar test functions, but generalizes to higher-dimensional test functions. Assumptions 9.4.1–2 is hold by Lemma A.1, while Assumption 9.4.3 holds by Lemma A.3. Thus, the conditions for Theorem 9.4.5.(ii) in Cappé et al. (2005) are satisfied, which completes the proof of Lemma A.5.

#### **Appendix B. Main Results**

**Proof of Lemma 1.** We compute conditional mean and variance of *ξ<sup>t</sup>* in Equation (2). First the mean

$$\begin{split} \mathbb{E}\_{\lambda} \left[ \mathfrak{z}\_{t} \mid \mathfrak{z}\_{t-1} \right] &= \mathbb{E}\_{\lambda} \left[ \mu + \Phi\_{t} \mathfrak{z}\_{t-1} + \nu\_{t} \mid \mathfrak{z}\_{t-1} \right] \\ &= \mu + \Phi \mathfrak{z}\_{t-1} . \end{split} \tag{A15}$$

and then the variance

$$\begin{array}{lcl} \mathsf{Var}\_{\lambda} \left[ \mathfrak{f}\_{t} \mid \mathfrak{f}\_{t-1} \right] &= \mathsf{Var}\_{\lambda} \left[ \mu + \Phi\_{t} \mathfrak{f}\_{t-1} + \nu\_{t} \mid \mathfrak{f}\_{t-1} \right] \\ &= \mathsf{Var}\_{\lambda} \left[ \mu + \left( \mathfrak{f}\_{t-1}^{\prime} \otimes I\_{\mathbb{P}} \right) \mathsf{vec}(\Phi\_{t}) + \nu\_{t} \mid \mathfrak{f}\_{t-1}^{\prime} \right] \\ &= \left( \mathfrak{f}\_{t-1}^{\prime} \otimes I\_{\mathbb{P}} \right) \mathsf{Var}\_{\lambda} \left[ \mathsf{vec}(\Phi\_{t}) \right] \left( \mathfrak{f}\_{t-1}^{\prime} \otimes I\_{\mathbb{P}} \right)^{\prime} + \mathsf{Var}\_{\lambda} \left[ \nu\_{t} \right] \\ &= \left( \mathfrak{f}\_{t-1}^{\prime} \otimes I\_{\mathbb{P}} \right) \Omega\_{\Phi} \left( \mathfrak{f}\_{t-1}^{\prime} \otimes I\_{\mathbb{P}} \right)^{\prime} + \Omega\_{\mathbb{V}} . \end{array} \tag{A16}$$

Since the conditional distribution of *ξ<sup>t</sup>* given *ξt*−<sup>1</sup> is Gaussian, it is completely characterized by its first and second conditional moments. Thus, we obtain equations (12)–(13), which completes the proof of Lemma 1.

**Proof of Lemma 2.** The result is an application of the Fisher's and Louis' identities to the SSR model. We use Proposition 10.1.6 in Cappé et al. (2005), by verifying the conditions.

First, we verify that Assumption 10.1.3 in Cappé et al. (2005) holds. We have that Θ is an open subset of <sup>R</sup>*d<sup>θ</sup>* , which satisfies Assumption 10.1.3.(i). Assumption 10.1.3.(ii) is satisfied via Lemma A.2. Assumption 10.1.3.(iii) is encompassed by condition (b) of Proposition 10.1.6 in Cappé et al. (2005), shown below. Thus, Assumption 10.1.3 in Cappé et al. (2005) holds.

Second, we verify conditions (a) and (b) of Proposition 10.1.6 in Cappé et al. (2005). Condition (a) holds by Conjecture 1. For condition (b), we begin with the third and last part, which states that

$$\frac{\partial}{\partial\theta} \int \log p\_{\theta}(y\_{1:t}, \mathbf{x}\_{1:T}) p\_{\theta}(\mathbf{x}\_{1:T} \mid y\_{1:T}) \, \mathrm{d}\mathbf{x}\_{1:T} = \int \frac{\partial}{\partial\theta} \log p\_{\theta}(y\_{1:t}, \mathbf{x}\_{1:T}) p\_{\theta}(\mathbf{x}\_{1:T} \mid y\_{1:T}) \, \mathrm{d}\mathbf{x}\_{1:T} \,. \tag{A17}$$

For *θ*, *ϑ* ∈ Θ, the complete data log-likelihood (32) is log-Gaussian and therefore continuous with respect to *θ*, and (A17) holds.

The second part of condition (b) states that for *θ* ∈ Θ,

$$\int \left| \| \mathcal{U}\_T(\mathbf{x}\_{1:T}; \theta) \| \right| \, \mathbf{p}\_\theta(\mathbf{x}\_{1:T} \mid y\_{1:T}) \, \mathrm{d}x\_{1:T} < \infty \tag{A18}$$

$$\int \left| V(\mathbf{x}\_{1:T}; \theta) \right| \left| \mathbf{p}\_{\theta}(\mathbf{x}\_{1:T} \mid y\_{1:T}) \, \mathrm{d}\mathbf{x}\_{1:T} < \infty \right. \tag{A19}$$

which is holds by Conjecture 2.

The first part of condition (b) states that for *θ*, *ϑ* ∈ Θ, the entropy function in (31) is twice-differentiable with respect to *θ* for fixed *ϑ* and *y*1:*T*. Using (A17) and that the complete data log-likelihood (32) is twice-differentiable with respect to *θ*, we have that (31) is also twice-differentiable with respect to *θ*. Thus, Proposition 10.1.6 in Cappé et al. (2005) applies for the SSR model, which completes the proof of Lemma 2.

**Proof of Lemma 3.** Define the conditional moments of the locally optimal transition density (46),

$$\|\mu\_{t|t}^{\mathbf{x}} := \mathbb{E}\_{\theta} \left[ \mathbf{x}\_{t} \mid \mathbf{x}\_{t-1}, \mathcal{Y}\_{t} \right] \quad \text{and} \quad \Sigma\_{t|t}^{\mathbf{x}} := \mathsf{Var}\_{\theta} \left[ \mathbf{x}\_{t} \mid \mathbf{x}\_{t-1}, \mathcal{Y}\_{t} \right] \,. \tag{A20}$$

Applying the Gaussian projection, we can write these as

$$\begin{split} \boldsymbol{\mu}\_{t|t}^{\boldsymbol{x}} &= \mathbb{E}\_{\lambda} \begin{bmatrix} \mathbf{x}\_{t} \mid \mathbf{x}\_{t-1} \end{bmatrix} + \mathbb{C} \text{cov}\_{\theta} \begin{bmatrix} \mathbf{x}\_{t}, \boldsymbol{y}\_{t} \mid \mathbf{x}\_{t-1} \end{bmatrix} \mathbb{V} \mathbf{ar}\_{\theta} \begin{bmatrix} \boldsymbol{y}\_{t} \mid \mathbf{x}\_{t-1} \end{bmatrix}^{-1} \begin{pmatrix} \boldsymbol{y}\_{t-1} - \mathbb{E}\_{\theta} \begin{bmatrix} \boldsymbol{y}\_{t} \mid \mathbf{x}\_{t-1} \end{bmatrix} \end{split} \tag{A21}$$
 
$$\begin{split} &= \boldsymbol{\mu}\_{t|t-1}^{\boldsymbol{x}} + \boldsymbol{\Sigma}\_{t|t-1}^{\boldsymbol{x}} \boldsymbol{\Pi}^{\prime} \begin{bmatrix} \boldsymbol{\Sigma}\_{t|t-1}^{\boldsymbol{y}} \end{bmatrix}^{-1} \begin{pmatrix} \boldsymbol{y}\_{t} - \boldsymbol{\mu}\_{t|t-1}^{\boldsymbol{y}} \end{pmatrix} \end{split} \tag{A21}$$

$$\begin{split} \boldsymbol{\Sigma}\_{t|t}^{\boldsymbol{x}} &= \mathsf{Var}\_{\boldsymbol{\Lambda}} \left[ \boldsymbol{\mathbf{x}}\_{t} \mid \mathbf{x}\_{t-1} \right] + \mathsf{Cov}\_{\boldsymbol{\theta}} \left[ \boldsymbol{\mathbf{x}}\_{t}, \boldsymbol{y}\_{t} \mid \mathbf{x}\_{t-1} \right] \boldsymbol{\Pi}^{\prime} \mathsf{Var}\_{\boldsymbol{\theta}} \left[ \boldsymbol{y}\_{t} \mid \mathbf{x}\_{t-1} \right]^{-1} \boldsymbol{\Pi} \mathsf{Cov}\_{\boldsymbol{\theta}} \left[ \boldsymbol{y}\_{t\prime} \mid \mathbf{x}\_{t} \mid \mathbf{x}\_{t-1} \right] \\ &= \boldsymbol{\Sigma}\_{t|t-1}^{\boldsymbol{x}} - \boldsymbol{\Sigma}\_{t|t-1}^{\boldsymbol{x}} \boldsymbol{\Pi}^{\prime} \left[ \boldsymbol{\Sigma}\_{t|t-1}^{\boldsymbol{y}} \right]^{-1} \boldsymbol{\Pi} \mathsf{\Sigma}\_{t|t-1}^{\boldsymbol{x}} \end{split} \tag{A22}$$

where we have used that,

$$\mathbb{C}\text{Cov}\_{\theta}\left[\mathbf{x}\_{t\prime}\,\mathcal{Y}\_{t\prime}\,\middle|\,\mathbf{x}\_{t-1}\right] = \mathbb{C}\text{cov}\_{\theta}\left[\mathbf{x}\_{t\prime}\,\mathbb{C}\left(\mathcal{Y}\_{0}\right) + \Pi\mathbf{x}\_{t\prime}\,\middle|\,\mathbf{x}\_{t-1}\,\,\middle|\,\mathbf{x}\_{t-1}\right] = \Sigma^{\mathbf{x}}\_{t\vert t-1}\Pi^{\prime}.\tag{A23}$$

We define the conditional moments of the predictive observation density,

$$\boldsymbol{\mu}\_{t|t-1}^{y} := \mathbb{E}\_{\theta} \left[ y\_t \mid \mathbf{x}\_{t-1} \right] = \mathbb{E}\_{\theta} \left[ \mathbf{C}(y\_0) + \Pi \mathbf{x}\_t \mid \mathbf{x}\_{t-1} \right] = \mathbf{C}(y\_0) + \Pi \boldsymbol{\mu}\_{t|t-1}^{\mathbf{x}} \tag{A24}$$

$$\Sigma\_{t|t-1}^y := \mathsf{Var}\_{\theta} \left[ y\_t \mid \mathbf{x}\_{t-1} \right] = \mathsf{Var}\_{\theta} \left[ \mathbf{C}(y\_0) + \Pi \mathbf{x}\_t \mid \mathbf{x}\_{t-1} \right] = \Pi \Sigma\_{t|t-1}^x \Pi' + \Omega\_u \tag{A25}$$

where we have used (22). Similarly, we define the conditional moments of the transition density,

$$
\mu\_{t|t-1}^x := \mathbb{E}\_{\lambda} \left[ \mathbf{x}\_t \mid \mathbf{x}\_{t-1} \right] = \mathfrak{a} + \Pi \mathbf{x}\_{t-1} \tag{A26}
$$

$$\Sigma\_{t|t-1}^{x} := \mathbb{V} \mathbf{ar}\_{\lambda} \left[ \mathbf{x}\_{t} \mid \mathbf{x}\_{t-1} \right] = \Lambda\_{t}, \tag{A27}$$

where we have used (23), which concludes the proof of Lemma 3.

**Proof of Lemma 4.** Lemma A.4 establishes that Theorem 9.4.5 in Cappé et al. (2005) holds. It is a corollary to Theorem 9.4.5 in Cappé et al. (2005) that

$$L\_T^N(\theta) := \prod\_{t=1}^T \mathcal{W}\_t^N \quad \stackrel{P}{\rightarrow} \quad p\_\theta(y\_{1:T}) =: L\_T(\theta) \,. \tag{A28}$$

as *N* → ∞. By continuity of the logarithm, the continuous mapping theorem and the definitions (8) and (61), we therefore have that,

$$\bar{\ell}\_T^N(\theta) = \log L\_T^N(\theta) \quad \stackrel{P}{\rightarrow} \quad \log L\_T(\theta) = \ell\_T(\theta) \,. \tag{A29}$$

as *N* → ∞, which completes the proof of Lemma A.4.

**Proof of Lemma 5.** We apply Lemma A.5 for *t* = *T* setting the test function to *γT*(*x*1:*T*) . .= *UT*(*x*1:*T*; *θ*), cf. (33). By Conjecture <sup>2</sup> we have that *UT*(*x*1:*T*; *<sup>θ</sup>*) <sup>∈</sup> *<sup>L</sup>*2[R*p*×*T*, *<sup>p</sup><sup>θ</sup>* (*x*1:*<sup>T</sup>* <sup>|</sup> *<sup>y</sup>*1:*T*)], which satisfies the condition, and Lemma A.5 applies.

**Proof of Lemma 6.** We apply Lemma A.4 to the functions *UT*(*x*1:*T*; *θ*), *VT*(*x*1:*T*; *θ*) and the outer product *UT*(*x*1:*T*; *θ*)*UT*(*x*1:*T*; *θ*) for *θ* ∈ Θ. First, Conjecture 2 implies that *UT*(*x*1:*T*; *θ*) ∈ *<sup>L</sup>*1[R*p*×*T*, *<sup>p</sup><sup>θ</sup>* (*x*1:*<sup>T</sup>* <sup>|</sup> *<sup>y</sup>*1:*T*)], such that by setting the test function to *<sup>γ</sup>T*(*x*1:*T*) . .= *UT*(*x*1:*T*; *θ*), Lemma A.4 gives us that,

$$\mathbb{E}\_{\theta}^{N}\left[\mathsf{U}\_{T}(\mathbf{x}\_{1:T};\theta)\mid y\_{1:t}\right] \quad \xrightarrow{P} \quad \mathbb{E}\_{\theta}\left[\mathsf{U}\_{T}(\mathbf{x}\_{1:T};\theta)\mid y\_{1:t}\right] \,\,\,\,\,\tag{A30}$$

as *<sup>N</sup>* <sup>→</sup> <sup>∞</sup>. Second, Conjecture <sup>2</sup> states *VT*(*x*1:*T*; *<sup>θ</sup>*) <sup>∈</sup> *<sup>L</sup>*1[R*p*×*T*, *<sup>p</sup><sup>θ</sup>* (*x*1:*<sup>T</sup>* <sup>|</sup> *<sup>y</sup>*1:*T*)], such that by setting the test function to *γT*(*x*1:*T*) . .= *VT*(*x*1:*T*; *θ*), Lemma A.4 gives us that,

$$\mathbb{E}\_{\theta}^{N}\left[V\_{T}(\mathbf{x}\_{1:T};\theta)\mid y\_{1:t}\right] \stackrel{P}{\rightarrow} \mathbb{E}\_{\theta}\left[V\_{T}(\mathbf{x}\_{1:T};\theta)\mid y\_{1:t}\right] \tag{A31}$$

as *N* → ∞. Third, we note that by the Cauchy-Schwarz inequality it holds that,

$$\|\|\mathcal{U}\_{\Gamma}(\mathbf{x}\_{1:T};\theta)\mathcal{U}\_{\Gamma}(\mathbf{x}\_{1:T};\theta)^{\prime}\|\leq\|\|\mathcal{U}\_{\Gamma}(\mathbf{x}\_{1:T};\theta)\|\|\|\mathcal{U}\_{\Gamma}(\mathbf{x}\_{1:T};\theta)^{\prime}\|\|=\|\|\mathcal{U}\_{\Gamma}(\mathbf{x}\_{1:T};\theta)\|\|^{2},\tag{A32}$$

such that, by Conjecture 2, we have that

$$\int \|\mathcal{U}\_{\Gamma}(\mathbf{x}\_{1:T};\theta)\mathcal{U}\_{\Gamma}(\mathbf{x}\_{1:T};\theta)^{\prime}\|\,\|p\_{\theta}(\mathbf{x}\_{1:T}\mid y\_{1:T})\,\mathbf{dx}\_{1:T} \leq \int \|\mathcal{U}\_{\Gamma}(\mathbf{x}\_{1:T};\theta)\|^{2} \, p\_{\theta}(\mathbf{x}\_{1:T}\mid y\_{1:T})\,\mathbf{dx}\_{1:T} < \infty. \tag{A.33}$$

Thus, by setting the test function to *γ<sup>T</sup>* . .= *UT*(*x*1:*T*; *θ*)*UT*(*x*1:*T*; *θ*) , Lemma A.4 gives us that,

$$\mathbb{E}\_{\theta}^{\tilde{N}}\left[\mathcal{U}\_{T}(\mathbf{x}\_{1:T};\theta)\mathcal{U}\_{T}(\mathbf{x}\_{1:T};\theta)^{\prime}\mid y\_{1:t}\right] \stackrel{P}{\rightarrow} \quad \mathbb{E}\_{\theta}\left[\mathcal{U}\_{T}(\mathbf{x}\_{1:T};\theta)\mathcal{U}\_{T}(\mathbf{x}\_{1:T};\theta)^{\prime}\mid y\_{1:t}\right],\tag{A34}$$

as *N* → ∞. Now, by (37), (39), (40), (63), (66), and (67), we have that (A30)–(A34) correspond to,

$$\mathcal{S}\_T^N(\theta) \quad \stackrel{P}{\rightarrow} \quad \mathcal{S}\_T(\theta) \tag{A35}$$

$$\begin{array}{ccc} \bar{\mathbf{G}}\_T^N(\boldsymbol{\theta}) & \stackrel{P}{\rightarrow} & \mathbf{G}\_T(\boldsymbol{\theta}) \end{array} \tag{A36}$$

$$
\tilde{\mathcal{K}}\_T^N(\theta) \quad \stackrel{\mathcal{V}}{\rightarrow} \quad \mathcal{K}\_T(\theta) \; , \tag{A37}
$$

as *N* → ∞, respectively, such that we get by the continuous mapping theorem that,

$$\bar{I}\_{T}^{N}(\theta) = \bar{S}\_{T}^{N}(\theta)\bar{S}\_{T}^{N}(\theta)^{\prime} - \bar{G}\_{T}^{N}(\theta) - \bar{K}\_{T}^{N}(\theta) \stackrel{P}{\rightarrow} \quad \mathsf{S}\tau(\theta)\mathsf{S}\tau(\theta)^{\prime} - \mathsf{K}\tau(\theta) - \mathsf{G}\tau(\theta) = \mathrm{I}\tau(\theta), \tag{A.38}$$

as *N* → ∞, which completes the proof of Lemma 6.

#### **References**


© 2018 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
