*2.2. INS Model*

In this paper, the mechanization is conducted in the *e*-frame (earth-centered earth-fixed frame) for easily integrating the state of INS with the GNSS observables. Then, the dynamic equation of INS can be described as

$$
\begin{pmatrix}
\dot{\mathbf{x}}\_{\text{INS}}^{\varepsilon} \\
\dot{\mathbf{v}}\_{\text{INS}}^{\varepsilon} \\
\dot{\mathbf{C}}\_{\text{b}}^{\varepsilon}
\end{pmatrix} = \begin{pmatrix}
\mathbf{v}\_{\text{INS}}^{\varepsilon} \\
\mathbf{C}\_{\text{b}}^{\varepsilon}\mathbf{f}^{\text{b}} - 2\boldsymbol{\omega}\_{\text{fc}}^{\varepsilon} \times \mathbf{v}\_{\text{INS}}^{\varepsilon} + \mathbf{g}^{\varepsilon} \\
\mathbf{C}\_{\text{b}}^{\varepsilon}\left[\boldsymbol{\omega}\_{\text{cb}}^{\mathrm{b}} \times \right]
\end{pmatrix} \tag{6}
$$

where *x<sup>e</sup> INS* is the position vector in the *<sup>e</sup>*-frame, respectively; *<sup>v</sup><sup>e</sup> INS* is the velocity vector in the *<sup>e</sup>*-frame; *Ce b* represents the rotation matrix from the *<sup>b</sup>*-frame (body frame) to the *<sup>e</sup>*-frame; *f <sup>b</sup>* is the specific force vector generated by the accelerometers in the *b*-frame; *ω<sup>e</sup> ie* is the earth rotation vector of the *<sup>e</sup>*-frame against the *<sup>i</sup>*-frame (inertial frame) in the *<sup>e</sup>*-frame; *ge* denotes the local gravity vector in the *e*-frame; *ω<sup>b</sup> eb* denotes the rotation rate vector of the *<sup>b</sup>*-frame against the *e*-frame projected to the *b*-frame; and [·×] denotes the skew-symmetric matrix.

By using the Phi-angle error model, the INS error model can be written as [42]

$$
\begin{pmatrix}
\delta \dot{\mathbf{x}}\_{\rm INS}^{\varepsilon} \\
\delta \dot{\boldsymbol{v}}\_{\rm INS}^{\varepsilon} \\
\dot{\boldsymbol{\Phi}}
\end{pmatrix} = \begin{pmatrix}
\delta \boldsymbol{v}\_{\rm INS}^{\varepsilon} \\
\end{pmatrix} \tag{7}
$$

in which *<sup>φ</sup>* indicates the correction vector of attitude; *<sup>δ</sup>g<sup>e</sup>* represents the gravity error in the e-frame; and *<sup>δ</sup><sup>f</sup> <sup>b</sup>* and *δω<sup>b</sup> ib* are the sensor errors of the accelerometer and gyroscope, respectively. Bias and scale factor errors along with white noise can be used to model the sensor error [42], which can be expressed as

$$\begin{cases} \delta \mathcal{f}^b = \mathcal{B}\_\mathsf{a} + \operatorname{diag} \left( \mathcal{f}^b \right) \mathcal{S}\_\mathsf{a} + \mathfrak{w}\_\mathsf{v} \\ \mathcal{S} \omega\_{\mathsf{i}\mathsf{b}}^b = \mathcal{B}\_\mathsf{\mathcal{S}} + \operatorname{diag} \left( \omega\_{\mathsf{i}\mathsf{b}}^{\mathsf{b}} \right) \mathcal{S}\_\mathsf{\mathcal{S}} + \mathfrak{w}\_\mathsf{\mathsf{\mathcal{A}}} \end{cases} \tag{8}$$

in which, *Ba* and *Sa* indicate the bias and scale factor errors of the accelerometer, respectively; *diag* denotes the diagonal matrix; *Bg* and *Sg* indicate the bias and scale factor errors of the gyroscope, respectively; and *wv* and *<sup>w</sup><sup>φ</sup>* indicate the corresponding random white noise. Bias and scale factor errors can be modeled as first-order Gauss–Markov processes and expressed as [42]

$$\begin{cases} \begin{pmatrix} \dot{\mathbf{B}}\_{a} \\ \dot{\mathbf{B}}\_{\mathcal{S}} \end{pmatrix} = \begin{pmatrix} \frac{-1}{\tau\_{b\_{\mathcal{S}}}} \mathbf{B}\_{a} \\ \frac{-1}{\tau\_{b\_{\mathcal{S}}}} \mathbf{B}\_{\mathcal{S}} \end{pmatrix} + \begin{pmatrix} \boldsymbol{w}\_{ba} \\ \boldsymbol{w}\_{bg} \end{pmatrix} \\ \begin{pmatrix} \dot{\mathbf{S}}\_{a} \\ \dot{\mathbf{S}}\_{\mathcal{S}} \end{pmatrix} = \begin{pmatrix} \frac{-1}{\tau\_{b\_{\mathcal{S}}}} \mathbf{S}\_{a} \\ \frac{-1}{\tau\_{b\_{\mathcal{S}}}} \mathbf{S}\_{\mathcal{S}} \end{pmatrix} + \begin{pmatrix} \boldsymbol{w}\_{sa} \\ \boldsymbol{w}\_{sg} \end{pmatrix} \end{cases} \tag{9}$$

where *<sup>τ</sup>*(•) and *<sup>w</sup>*(•) (denotes the subscript *ba*, *bg*, *sa*, or *sg*) denote the corresponding correlation time and driving white noise, respectively.

Finally, the INS error state can be modeled as

$$\mathbf{x\_{INS}} = \begin{pmatrix} \delta \mathbf{x\_{INS}^{\varepsilon}} & \delta v\_{\textrm{INS}}^{\varepsilon} & \boldsymbol{\Phi} & \boldsymbol{B}\_{\mathbf{a}} & \boldsymbol{B}\_{\mathbf{g}} & \boldsymbol{S}\_{\mathbf{a}} \boldsymbol{S}\_{\mathbf{g}} \end{pmatrix}^{\mathrm{T}} \tag{10}$$

#### *2.3. PPP-RTK/INS Tightly Coupled Integration Model*

In the error state of PPP-RTK and INS, *x<sup>e</sup> GNSS* and *<sup>x</sup><sup>e</sup> INS* denote the position of the GNSS receiver antenna reference point (ARP) and IMU center, respectively. They do not represent the same position and their spatial relationship in the *e*-frame can be expressed as [42]

$$\mathbf{x}\_{\rm GNSS}^{\mathbf{r}} = \mathbf{x}\_{\rm INS}^{\mathbf{r}} + \mathbf{C}\_{b}^{\mathbf{r}} l^{b} \tag{11}$$

in which *l <sup>b</sup>* means the lever-arm correction vector in the *b*-frame. As for the approximate coordinates *x<sup>e</sup> GNSS* and *x<sup>e</sup> INS*, their relationship can be described as

$$
\breve{\mathbf{x}}\_{\rm GNSS}^{\mathbf{r}} = \breve{\mathbf{x}}\_{\rm INS}^{\mathbf{r}} + \breve{C}\_{b}^{\mathbf{r}} l^{b} \tag{12}
$$

where *C<sup>e</sup> <sup>b</sup>* is the approximation of *<sup>C</sup><sup>e</sup> b*, and satisfies

$$
\overline{\mathbf{C}}\_{b}^{t} = (I - \Phi \times) \mathbf{C}\_{b}^{t} \tag{13}
$$

Then, the following equation of *<sup>δ</sup>x<sup>e</sup> GNSS* and <sup>δ</sup>*x<sup>e</sup> INS* can be derived from Equations (11) to (13):

$$
\delta \mathfrak{x}\_{\rm GNSS}^{\mathfrak{c}} = \delta \mathfrak{x}\_{\rm INS}^{\mathfrak{c}} + \mathcal{C}\_{\mathfrak{b}}^{\mathfrak{c}} l^{\mathfrak{b}} \times \mathfrak{q} \tag{14}
$$

The state error in the integrated navigation is defined as the observation minus the true value, whereas the state error in the GNSS is defined as the true value minus the observation. Thus, the signs of *<sup>δ</sup>x<sup>e</sup> GNSS* and *<sup>δ</sup>x<sup>e</sup> INS* are opposite. After adding a minus sign to Equation (14) and substituting it into Equation (2), the observation equation of the PPP-RTK/INS can be further expressed as

$$\begin{aligned} \Delta P\_{r,f}^s &= -\mathbf{h}\_r^s \delta \mathbf{x}\_{\text{INS}}^t - \mathbf{h}\_r^s \mathbf{C}\_b^r \mathbf{b}^s \times \Phi + t\_{r, \text{sys}} + a\_r^s \delta T\_w + b\_{r,f} + \varepsilon\_p \\ &+ \frac{40.3}{\bar{f}^2} \gamma\_r^s (a\_0 + a\_1 dL + a\_2 dB + a\_3 dL^2 + a\_4 dB^2 + r\_r^s + \varepsilon\_{\bar{l}\_r^s}) \\ \Delta \Phi\_{r,f}^s &= -\mathbf{h}\_r^s \delta \mathbf{x}\_{\text{INS}}^s - \mathbf{h}\_r^s \mathbf{C}\_b^r \mathbf{b}^s \times \Phi + t\_{r, \text{sys}} + a\_s^s \delta T\_w + \lambda N\_{r,f}^s + \varepsilon\_{\Phi} \\ &- \frac{40.3}{\bar{f}^2} \gamma\_r^s (a\_0 + a\_1 dL + a\_2 dB + a\_3 dL^2 + a\_4 dB^2 + r\_r^s + \varepsilon\_{\bar{l}\_r^s}) \\ \bar{I}\_r^s &= a\_0 + a\_1 dL + a\_2 dB + a\_3 dL^2 + a\_4 dB^2 + r\_r^s + \varepsilon\_{\bar{l}\_r^s} \end{aligned} \tag{15}$$

Combining the state vector *xPPP*−*RTK* in Equation (4) and *xINS* in Equation (10), the state vector of the PPP-RTK/INS can be described as

$$\mathbf{x} = \begin{pmatrix} \delta \mathbf{x}\_{\text{INS}}^{\mathbf{e}} & \delta v\_{\text{INS}}^{\mathbf{e}} & \Phi & \mathbf{B} & \mathbf{S} & t\_r & \delta T\_{\text{w}} & b\_r & N\_r & a\_r & r\_r \end{pmatrix}^T \tag{16}$$

## *2.4. INS/Vision Tightly Coupled Integration Model*

By denoting the error state of the camera as *δpe ci φci* for the *i*th image, the error state vector of INS/vision tightly coupled integration with the MSCKF at the time when the *k*th image is captured is expressed as

$$\mathbf{x}\_{i,\varepsilon} = \begin{pmatrix} \mathbf{x}\_{\text{INS}} & \mid & \delta p\_{\mathbf{c}\_j}^{\varepsilon} & \Phi\_{\mathbf{c}\_j} & \dots & \delta p\_{\mathbf{c}\_k}^{\varepsilon} \Phi\_{\mathbf{c}\_k} \end{pmatrix}^T \tag{17}$$

in which <sup>δ</sup>*p<sup>e</sup> ci* and *φci* (*i* = *j*, *j* + 1, ... , *k*) indicate the error states of the camera position and attitude for epoch *i*. The above error state vector is augmented when new camera data is introduced.

Visual measurements of the same feature point from multiple camera views are used to construct the geometric constraints. At the time of taking the *i*th (*i* < *k* + 1) image, the transformation of the static feature point *Pk* can be expressed as

$$p\_{P\_k,i}^{c\_i} = R\_\epsilon^{c\_i} \left( p\_{P\_k}^\epsilon - p\_{c\_i}^\epsilon \right) \tag{18}$$

in which, *pci Pk*,*<sup>i</sup>* = - *xci Pk*,*<sup>i</sup> <sup>y</sup>ci Pk*,*<sup>i</sup> z ci Pk*,*i* indicates the position of *Pk* in the camera frame; *Rci <sup>e</sup>* and *<sup>p</sup><sup>e</sup> ci* are the attitude rotation matrix and position vector against the global frame (*e*-frame), respectively; *p<sup>e</sup> Pk* is the estimated position of *Pk* in the *e*-frame, which can be calculated by triangulation. By differentiating Equation (18), the equation of the camera state can be obtained as follows:

$$\delta \boldsymbol{p}\_{P\_{\mathbb{R},i}}^{c\_{i}} = -\mathbf{R}\_{\varepsilon}^{c\_{i}} \left[ \left( \mathbf{p}\_{P\_{k}}^{\varepsilon} - \mathbf{p}\_{c\_{i}}^{\varepsilon} \right) \times \right] \boldsymbol{\phi}\_{c\_{i}} - \mathbf{R}\_{\varepsilon}^{c\_{i}} \delta \boldsymbol{p}\_{c\_{i}}^{\varepsilon} + \mathbf{R}\_{\varepsilon}^{c\_{i}} \delta \boldsymbol{p}\_{P\_{k}}^{\varepsilon} \tag{19}$$

in which *<sup>δ</sup>p<sup>e</sup> Pk* indicates the error of the approximate position of the feature point.

Concerning the camera measurement residual vector, it can be written as

$$z\_{P\_{\mathbb{k}},i} = \begin{pmatrix} \mu\_{P\_{\mathbb{k}},i}^0 - \bar{u}\_{P\_{\mathbb{k}},i} \\ \upsilon\_{P\_{\mathbb{k}},i}^0 - \bar{\upsilon}\_{P\_{\mathbb{k}},i} \end{pmatrix}, \ \varepsilon\_{P\_{\mathbb{k}},i} = \begin{pmatrix} \varepsilon\_{\overline{u}\_{P\_{\mathbb{k}},i}} \\ \varepsilon\_{\overline{\upsilon}\_{P\_{\mathbb{k}},i}} \end{pmatrix} \tag{20}$$

where (*u*<sup>0</sup> *Pk*,*i* , *v*<sup>0</sup> *Pk*,*i* ) *<sup>T</sup>* is the estimated pixel coordinate of *Pk* by back projection and (*uPk*,*i*, *<sup>v</sup>Pk*,*i*) *T* is the observation of *Pk* in the *i*th image and *εPk*,*<sup>i</sup>* is the measurement noise. Then, based on the chain rule, the residual formula can be expressed as [27]

$$\mathbf{z}\_{P\_{\mathbf{k}},i} = \delta \left(\boldsymbol{u}\_{P\_{\mathbf{k}},i\prime}, \boldsymbol{v}\_{P\_{\mathbf{k}},i}\right)^{T} = \frac{\partial \left(\boldsymbol{u}\_{P\_{\mathbf{k}},i\prime}, \boldsymbol{v}\_{P\_{\mathbf{k}},i}\right)^{T}}{\partial \boldsymbol{p}\_{P\_{\mathbf{k}},i}^{c\_{i}}} \delta \boldsymbol{p}\_{P\_{\mathbf{k}},i}^{c\_{i}} = \boldsymbol{H}\_{\mathbf{x},i} \mathbf{x}\_{i,\varepsilon} + \boldsymbol{H}\_{f,i} \delta \boldsymbol{p}\_{P\_{\mathbf{k}}}^{c} + \boldsymbol{\varepsilon}\_{P\_{\mathbf{k}},i} \tag{21}$$

where

$$\mathbf{H}\_{\mathbf{x};j} = \begin{pmatrix} \mathbf{0}\_{2 \times 21} & \mathbf{0}\_{2 \times 6} & \dots & -\mathbf{J} \mathbf{K}\_{\varepsilon}^{\mathbb{C}\mathbf{i}} & -\mathbf{J} \mathbf{K}\_{\varepsilon}^{\mathbb{C}\mathbf{j}} \left[ \left( \mathbf{p}\_{\mathbf{p}\_{\mathbf{i}}}^{\varepsilon} - \mathbf{p}\_{\mathbf{e}\_{i}}^{\varepsilon} \right) \times \right] & \dots \end{pmatrix} \\ \mathbf{H}\_{\mathbf{j};j} = \mathbf{J} \mathbf{K}\_{\varepsilon}^{\mathbb{C}\mathbf{j}} \mathbf{J} = \frac{\partial \begin{pmatrix} \mathbf{u}\_{\mathbf{p}\_{\mathbf{i}},j} \boldsymbol{\upsilon}\_{\mathbf{p}\_{\mathbf{j}},j} \end{pmatrix}^{\mathsf{T}}}{\partial \mathbf{p}\_{\mathbf{p}\_{\mathbf{i}}}^{\varepsilon\_{1}}} = \begin{pmatrix} \frac{\mathbf{f}\_{\mathbf{x}}}{\mathbf{z}\_{\mathbf{i}}^{\varepsilon\_{1}}} & \mathbf{0} & -\frac{\mathbf{f}\_{\mathbf{x}} \mathbf{z}\_{\mathbf{i}}^{\varepsilon\_{2}}}{\mathbf{z}\_{\mathbf{i}}^{\varepsilon\_{2}}}\\ \mathbf{0} & \frac{\mathbf{f}\_{\mathbf{y}}}{\mathbf{z}\_{\mathbf{i}}^{\varepsilon\_{1}}} & -\frac{\mathbf{f}\_{\mathbf{y}} \mathbf{z}\_{\mathbf{i}}^{\varepsilon\_{2}}}{\mathbf{z}\_{\mathbf{i}}^{\varepsilon\_{1}}} \end{pmatrix}$$

in which (*fx*, *fy*) means the focal length and - *xci Pk*,*<sup>i</sup> <sup>y</sup>ci Pk*,*<sup>i</sup> z ci Pk*,*i* means the position of a feature point in the camera frame.

Because *<sup>δ</sup>p<sup>e</sup> Pk* is not the state that needs to be estimated and *<sup>H</sup><sup>f</sup>* ,*<sup>i</sup>* is known, we can calculate the left null space A, which satisfies the equation as follows:

$$A^T H\_{f,i} = 0 \tag{22}$$

Then, multiplying *A<sup>T</sup>* at both sides of Equation (21), the measurement model can be described as

$$\mathbf{z}\_{a,i} = \mathbf{A}^T \mathbf{z}\_{P\_k, i} = \mathbf{A}^T \mathbf{H}\_{\mathbf{x},i} \mathbf{x}\_{i,c} + \mathbf{A}^T \boldsymbol{\varepsilon}\_{P\_k, i} = \mathbf{H}\_{a,x,i} \mathbf{x}\_{i,c} + \boldsymbol{\varepsilon}\_{a,i} \tag{23}$$

where *<sup>z</sup>a*,*<sup>i</sup>* <sup>=</sup> *<sup>A</sup>TzPk*,*i*, *<sup>H</sup>a*,*x*,*<sup>i</sup>* <sup>=</sup> *<sup>A</sup>THx*,*<sup>i</sup>* and *<sup>ε</sup>a*,*<sup>i</sup>* <sup>=</sup> *<sup>A</sup>TεPk*,*i*, respectively. The pixel coordinate of *Pk* can be described as follows:

$$
\begin{pmatrix} \overline{u}\_{P\_{\mathbb{k}},i} \\ \overline{v}\_{P\_{\mathbb{k}},i} \\ 1 \end{pmatrix} = \frac{1}{z\_{P\_{\mathbb{k}},i}^{c\_i}} \mathsf{K} p\_{P\_{\mathbb{k}},i}^{c\_i} \tag{24}
$$

in which K denotes the camera intrinsic parameter.

Substituting Equation (18) into (25), we have:

$$
\begin{pmatrix} \tilde{u}\_{P\_k,i} \\ \tilde{v}\_{P\_k,i} \\ 1 \end{pmatrix} = \frac{1}{z\_{P\_k,i}^{c\_i}} K \mathsf{R}\_c^{c\_i} \left( p\_{P\_k}^{\varepsilon} - p\_{c\_i}^{\varepsilon} \right) \tag{25}
$$

Thus,

$$
\begin{pmatrix}
\Delta\tilde{u}\_{P\_k} \\
\Delta\tilde{v}\_{P\_k} \\
0
\end{pmatrix} = K \begin{bmatrix}
1 \\
\frac{c\_{i+1}}{2^{c\_{i+1}}\_{P\_k, i+1}} \mathbf{R}\_c^{c\_{i+1}} \left(\mathbf{p}\_{P\_k}^{\xi} - \mathbf{p}\_{c\_{i+1}}^{\varepsilon}\right) - \frac{1}{z\_{P\_k, i}^{c\_i}} \mathbf{R}\_c^{c\_i} \left(\mathbf{p}\_{P\_k}^{\varepsilon} - \mathbf{p}\_{c\_i}^{\varepsilon}\right) \\
\end{bmatrix} \tag{26}
$$

where <sup>Δ</sup>*uPk* <sup>=</sup> *<sup>u</sup>Pk*,*i*+<sup>1</sup> <sup>−</sup> *<sup>u</sup>Pk*,*i*, <sup>Δ</sup>*vPk* <sup>=</sup> *<sup>v</sup>Pk*,*i*+<sup>1</sup> <sup>−</sup> *<sup>v</sup>Pk*,*i*. *<sup>p</sup>*´ *e Pk* is the position of *Pk* for the epoch i + 1. (*uPk*,*i*+1, *<sup>v</sup>Pk*,*i*+1) *<sup>T</sup>* is the observation of *Pk* in the i+1th image.

Considering that the time interval is relatively small, we assume that the pose and position of the camera have no obvious changes. Because *z ci*+<sup>1</sup> *Pk*,*i*+<sup>1</sup> is relatively large, we can make assumptions: *<sup>R</sup>ci*+<sup>1</sup> *<sup>e</sup>* <sup>≈</sup> *<sup>R</sup>ci <sup>e</sup>* , <sup>1</sup> *z ci*+1 *Pk*,*i*+<sup>1</sup> <sup>≈</sup> <sup>1</sup> *z ci Pk*,*i* . For the static feature point, *p<sup>e</sup> Pk* <sup>−</sup> *<sup>p</sup>*´ *e Pk* = 0, thus we can obtain

$$z\_{P\_k,i+1}^{\mathbb{C}i+1} \begin{pmatrix} \Delta \overline{u}\_{P\_k} \\ \Delta \overline{v}\_{P\_k} \\ 0 \end{pmatrix} = -K\mathsf{R}\_c^{\mathbb{C}i+1} \left( p\_{c\_{i+1}}^{\varepsilon} - p\_{c\_i}^{\varepsilon} \right) \tag{27}$$

As we know, *p<sup>e</sup> ci*+<sup>1</sup> and *<sup>p</sup><sup>e</sup> ci* can be obtained by PPP-RTK/INS. For dynamic objects on the road, [- *pe Pk* <sup>−</sup> *<sup>p</sup><sup>e</sup> ci*+<sup>1</sup> − - *p*´ *e Pk* <sup>−</sup> *<sup>p</sup><sup>e</sup> ci* ] < - *pe ci*+<sup>1</sup> <sup>−</sup> *<sup>p</sup><sup>e</sup> ci* . By setting a threshold for *z ci*+<sup>1</sup> *Pk*,*i*+1, the dynamic objects can be removed.

#### *2.5. PPP-RTK/INS/Vision Integration Model with a Cascading Filter*

The PPP-RTK/INS/vision integration model with a cascading filter is realized by integrating the output of the tightly coupled integration of PPP-RTK/INS with the tightly coupled integration of INS/vision. The difference between the position provided by the PPP-RTK/INS and the position predicted by INS/vision constitutes the observation of position. Based on Equations (14) and (23), the measurement model of the PPP-RTK/INS/Vision integration model can be expressed as

$$\begin{cases} \mathfrak{X}\_{GNSS}^{\varepsilon} - \mathfrak{X}\_{GNSS}^{\varepsilon} = \delta \mathfrak{X}\_{\mathrm{INS}}^{\varepsilon} + \mathfrak{C}\_{b}^{\varepsilon} \mathfrak{l}^{b} \times \mathfrak{g} + \varepsilon\_{\mathrm{x}}\\ z\_{a,i} = H\_{a,\mathrm{x},i} \mathfrak{x}\_{i,\mathrm{c}} + \varepsilon\_{a,i} \end{cases} \tag{28}$$

in which *x*ˆ*<sup>e</sup> GNSS* is the position predicted by INS/vision and *x<sup>e</sup> GNSS* is the positioning results of PPP-RTK/INS. *<sup>ε</sup>x* is the measurement noise.

The integration model of multi-GNSS PPP-RTK/INS/vision with a cascading filter is derived from the description above. An overview of the proposed model is shown in Figure 1. First, the position is obtained by the GNSS to assist the navigation initialization, e.g., IMU alignment with GNSS. Then, the INS begins to provide high-rate navigation. When the INS synchronizes with the GNSS, tightly coupled integration is performed based on either PPP/INS or PPP-RTK/INS. As for the latter, high-precision atmospheric correction is applied and AR is carried out with the UPD products. Furthermore, feature points are extracted and tracked in each image. When the time of the camera synchronizes with the INS, the state vector is augmented and the MSCKF is adopted to calculate the relative position of the vehicle platform. The positioning results of the GNSS/INS are then integrated with the MSCKF to produce the final navigation information. There are two Kalman filters: the filter of PPP-RTK/INS and the filter of INS/Vision. The positioning results of PPP-RTK/INS are added into the filter of the INS/vision so it becomes a cascading Kalman filter. All the IMU sensor errors are fed back in time in the process.

**Figure 1.** The algorithm structure of PPP-RTK/INS/VISION with a cascading filter.

#### **3. Experiment**

To evaluate the positioning accuracy and performance of the proposed integration model of multi-GNSS PPP-RTK/INS/vision with a cascading filter in urban areas, based on the FUSING (FUSing IN Gnss) software [7,40,41], this algorithm was further developed by us. At present, FUSING has been developed into an integrated software platform that can deal with real-time multi-GNSS precise orbit determination, atmospheric delay modeling and monitoring, satellite clock estimation, as well as multi-sensor fusion navigation.

Two datasets were collected based on a vehicle platform as shown in Figure 2. One is in the suburban area of Wuhan City on 1 January 2020, and the other is on the Second Ring Road of Wuhan city on 2 January 2020. For simplicity, according to the DOY (day of the year), the two-vehicle tests are denoted as T01 and T02, respectively.

**Figure 2.** GNSS/INS/vision data collection platform.

As shown in Figure 2, the raw data was collected by the IMU of two different grades. A MEMS-grade IMU was used to integrate with PPP-RTK and vision to evaluate the performance of the proposed model. The navigation-grade IMU is of high accuracy and is integrated with RTK to calculate the reference solution, which was regarded as the true value. Both IMUs collected the data at a sampling rate of 200 Hz and their performance parameters are listed in Table 1. The grayscale Basler acA640-90gm camera was equipped to collect the raw images at a sampling rate of 10 Hz with a resolution of 659 × 494. The UBLOX-M8T was used for generating the pulses per second (PPS) to trigger the camera exposure and it also recorded the time of the pulse at the same time. GNSS data were collected by Trimble Net R9 at a sampling rate of 1 Hz. The camera-IMU extrinsic parameters were calibrated offline by utilizing the Kalibr tool (https://github. com/ethz-asl/kalibr/ (accessed on 3 January 2020)). The lever-arm correction vector was measured manually.


**Table 1.** Performance parameters of the IMU sensors.

The test trajectory and the true scenarios for these two tests are shown in the left and right panels of Figure 3, respectively. It can be seen that dataset T01 was collected in a relatively open sky and only a few obstacles were blocking the GNSS signals. T02 was collected in a GNSS-challenged environment and there were many tall buildings on both sides of the narrow road, including some viaducts and tunnels, which could have totally blocked the GNSS signals. The vehicle speeds of T01 and T02 were about 10 m/s and 15 m/s, respectively, which is shown in Figure 4. It can be seen that there were significant changes in velocity and direction. The number of visible satellites and the PDOP (precision dilution of positioning) with a cutoff angle of 10◦ are shown in Figure 5. Taking, for instance, the GPS, the average number of tracking satellites of T01 and T02 were 9 and 7, respectively, and the average PDOP values were 1.8 and 10.9, respectively, which demonstrates the difference in the observation environment between T01 and T02.

**Figure 3.** The test trajectory and typical scenarios of T01 (left panel) and T02 (right panel).

**Figure 4.** Velocity of the vehicle for T01 (left panel) and T02 (right panel), respectively.

**Figure 5.** Satellite number of GPS/BDS and PDOP for T01 (left panel) and T02 (right panel), respectively.

Considering the high-precision ionospheric and tropospheric delay augmentation and ambiguity resolution to support PPP-RTK, the measurement data of seven reference stations as distributed in Figure 6 were also collected. They were processed to generate high-precision atmospheric delay corrections and UPD products. The average distance between the seven reference stations is 40 km and the green trajectories are the trajectories of the two tests as shown in Figure 3.

**Figure 6.** Distribution of seven reference stations for generating the atmospheric and UPD products. The green lines denote the trajectories of the two experiments.

The details of the GNSS data processing strategy are presented in Table 2. The positioning performance was evaluated by RMS (root mean square). The reference solution was calculated by a loosely coupled RTK/INS solution with a bi-directional smoothing algorithm, in which navigation-grade IMU and GNSS data collected by Trimble R9 were adopted. The ground truth was calculated using commercial software named GINS (http://www.whmpst.com/cn/ (accessed on 15 April 2021)). The nominal positioning accuracy of the RTK/INS loosely coupled solution provided by GINS was at the level of 2 cm for horizontal and 3 cm for vertical.


**Table 2.** GNSS data processing strategy of PPP and PPP-RTK.

#### **4. Results**

All results were output at a frequency of 1 Hz. In the following analysis, PPP-RTK/INS means the tightly coupled integration of PPP-RTK and INS. PPP-RTK/INS/vision means the integration of PPP-RTK, INS, and vision with a cascading filter. Before analyzing the performance of our proposed integration system in urban areas, the effects of the dynamic feature point removal algorithm aided by position are presented in Figure 7. It can be seen that the feature points on the car (in the red box) were removed. Some points that were extracted from unobvious places were also removed and the most static obvious feature points were saved, which can be used for visual localization. Figure 8 shows the positioning results of PPP-RTK/INS/vision for T02 before and after dynamic feature point removal. The positioning accuracy was improved by dynamic feature point removal, which is demonstrated in the green box in Figure 8. When PPP-RTK/INS provided stable, high-accuracy position information, the error caused by the dynamic feature points was constrained. However, the positioning performance was obviously influenced by the dynamic feature points when the GNSS signals were interfered with. Combining Figures 5 and 8, the GNSS observation conditions were poor and PPP-RTK/INS performed poorly in positioning around time 362,900 s, 364,000 s, and 364,600 s, thus PPP-RTK/INS was not able to restrain the interference of the dynamic feature points. Therefore, the dynamic feature points removal algorithm based on position improved the positioning accuracy when the GNSS signals were severely disturbed. The statistics of the positioning results show that the positioning accuracy was improved by 3 cm and 1 cm in the horizontal and vertical directions, respectively.

**Figure 7.** The effect of dynamic feature points removal.

**Figure 8.** The effects of dynamic feature points removal on positioning accuracy.

The comparison of the performance of different positioning solutions is presented in Figures 9 and 10 for T01 and T02, respectively. It can be seen in the left section in Figure 9 that PPP-RTK converged much faster than PPP, though the ambiguity sometimes failed to fix. It took about 30 s to converge for PPP-RTK and more than 10 min for PPP. Moreover, PPP-RTK had higher accuracy than PPP after convergence. The inclusion of BDS made the series more stable, especially for the vertical direction. Although the contribution of the INS was rather limited in the horizontal direction as shown in the GC-PPP-RTK/INS solution, the outliers may have been inhibited, e.g., around the time 28,400 s. As for the vertical direction, the INS significantly contributed to the improvement of the positioning accuracy. The INS helped the GNSS to converge to and maintain a higher level of positioning accuracy. Additionally, the introduction of vision reduced the fluctuation, although overall, there was no big difference. As for T02, there was no obvious convergence in Figure 10 as the observation environment was complicated and the positioning accuracy was relatively low. However, it still can be seen in the left section in Figure 10 that PPP-RTK performed better than PPP. In order to further demonstrate the convergence and reconvergence effects of PPP and PPP-RTK, enlarged images of parts of Figure 10 are shown in Figure 11 and the correspondence can be seen from the time of the week. PPP-RTK converged and reconverged much faster than PPP and achieved higher positioning accuracy, though the observational environment was challenging. Figure 10 shows that the series of G-PPP and G-PPP-RTK were interrupted many times because the GNSS signals were blocked out, which is embodied in the tracking number of the visible satellites in Figure 5. Both the continuity and accuracy of the positioning were improved with the BDS included. However, the GNSS was still unable to provide positioning results from 362,906 s to 362,965 s because the vehicle was in the tunnel at that time and there was no GNSS signal at all. GC-PPP-RTK/INS provided continuous and more stable positioning information with the integration with the INS, but there was also obvious fluctuation and the existence of epochs with large positioning errors. The positioning errors diverged to 21.74 m, 15.82 m, and 3.82 m in the north, east, and down directions, respectively. The three-dimensional positioning error at 362,965 s was 27.16 m, which is not suitable for vehicle navigation. Furthermore, the positioning error was obviously reduced when vision was included, especially around the time of the week at 362,900 s, 363,600 s, and 364,000 s. The cumulative

errors of the INS were effectively constrained by vision. Thus, in the period from 362,906 s to 362,965 s, the maximum errors of the INS/vision were −1.43 m, −3.94 m, and 2.50 m in the north, east, and down directions, respectively. The corresponding three-dimensional error was 4.88 m, which was 0.49% of the traveled distance. It can be concluded that the integration of multi-GNSS PPP-RTK/INS/vision with a cascading filter performed best in comparison with the other four solutions.

**Figure 9.** Position difference series of T01 in north, east, and down directions, respectively.

**Figure 10.** Position difference series of T02 in north, east, and down directions, respectively.

**Figure 11.** Convergence (left panel) and reconvergence (middle panel and right panel) of T02 in north, east, and down directions, respectively.

The statistics of the position difference between T01 and T02 are presented in Tables 3 and 4 to further verify the conclusion. Additionally, the number of epochs at which the position information can be obtained is denoted as A and the number of total epochs is denoted as B. Then the positioning availability can be defined as A/B, which is also included in the tables. The improvement statistics are derived in comparison with the G-PPP solution. The statistics show that the PPP-RTK performed better than PPP in both tests. The ambiguity could not be fixed in many epochs for the frequent GPS signal interruptions and disturbances in T02. Thus, the improvement brought by PPP-RTK for T02 was not as obvious as for T01. The horizontal and vertical positioning RMS of the GC-PPP-RTK/INS/vision solution for the test T01 were 0.08 m and 0.09 m, respectively. As for T02, the horizontal and vertical RMS of the GC-PPP-RTK/INS/vision solution were 0.83 m and 0.91 m, respectively. It can be seen that GC-PPP-RTK/INS/vision made significant improvements compared with the other four solutions. The improvements should have been more obvious because of the interruption in G-PPP and G-PPP-RTK. The positioning availabilities of G-PPP and G-PPP-RTK were both 90.5%. The availability increased to 95.1% with the inclusion of BDS, which means that more epochs of worse positioning were taken into consideration. Because of the accumulation errors of the INS, the statistics of GC-PPP-RTK/INS were worse than all the other solutions in the horizontal direction. The statistics of GC-PPP-RTK, GC-PPP-RTK/INS, and GC-PPP-RTK/INS/vision are shown in Table 5, in which the positioning results derived by the INS and vision are excluded, in order to better show the improvements brought by the INS and vision to the positioning performance of GC-PPP-RTK. The improvement statistics were derived in comparison with the GC-PPP-RTK solution. It can be seen that the INS improved the performance of GC-PPP-RTK by 31.4% and 37.1% in the horizontal and vertical directions, respectively. Eventually, the inclusion of vision increased the improvements to 37.1% and 42.2% in the horizontal and vertical directions, respectively.


**Table 3.** RMS of the positioning error of T01 in horizontal and vertical directions, respectively.

**Table 4.** RMS of the positioning error of T02 in horizontal and vertical directions, respectively.


**Table 5.** RMS of the position difference of T02 with part of the epoch excluded in horizontal and vertical directions, respectively.


Position, velocity, and attitude are of great importance for vehicle navigation in urban areas. The velocity error series of GC-PPP-RTK/INS and GC-PPP-RTK/INS/vision for T01 and T02 are shown in Figures 12 and 13, respectively. As dataset T01 was collected in a relatively open-sky environment, the error series was very stable. There was no obvious difference between GC-PPP-RTK/INS and GC-PPP-RTK/INS/vision. Because T02 was collected in a GNSS-challenged environment, there were obvious divergences in the velocity estimation around 362,900 s, 363,600 s, and 364,000 s. The inclusion of vision weakened the impact and improved the accuracy of the velocity estimation in three directions. It can be seen from the statistics in Tables 6 and 7 that vision could not bring about obvious improvements in an open-sky environment but greatly improved the velocity estimation accuracy in a GNSS-challenged environment. The error of velocity estimation in T02 was reduced by the inclusion of vision from 0.12 m/s, 0.07 m/s, and 0.07 m/s to 0.03 m/s, 0.05 m/s, and 0.05 m/s in the north, east, and down directions, respectively. The improvement in all three directions was more than 20%.

**Table 6.** RMS of the velocity error of T01 in north, east, and down directions, respectively.


**Table 7.** RMS of the velocity error of T02 in north, east, and down directions, respectively.


**Figure 12.** Velocity error series of GC-PPP-RTK/INS and GC-PPP-RTK/INS/vision for T01.

**Figure 13.** Velocity error series of GC-PPP-RTK/INS and GC-PPP-RTK/INS/vision for T02.

The attitude error series of GC-PPP-RTK/INS and GC-PPP-RTK/INS/vision for T01 and T02 are shown in Figures 14 and 15, respectively. It can be seen that there was no obvious difference between GC-PPP-RTK/INS and GC-PPP-RTK/INS/vision in the roll and pitch angles in T01. As for T02, the error of pitch and yaw angle of GC-PPP-RTK/INS accumulated when GNSS signals were blocked or interfered with. Figure 15 shows that vision helped to constrain the error divergence around 363,000 s when the GNSS signals were blocked, but there was no obvious difference in the roll and pitch angles at other parts in T02. However, the estimation accuracy of the yaw angle was significantly improved with vision aiding in both tests. The statistics of the attitude error are listed in Tables 8 and 9. The inclusion of vision reduced the error of the yaw angle from 0.24◦ to 0.11◦ and from 0.39◦ to 0.26◦ for T01 and T02, respectively. The improvement rates were more than 30% in both tests.

**Figure 14.** Attitude error series of GC-PPP-RTK/INS and GC-PPP-RTK/INS/vision for T01.

**Figure 15.** Attitude error series of GC-PPP-RTK/INS and GC-PPP-RTK/INS/vision for T02.


**Table 8.** RMS of the attitude error of T01 for GC-PPP-RTK/INS and GC-PPP-RTK/INS/vision.

**Table 9.** RMS of the attitude error of T02 for GC-PPP-RTK/INS and GC-PPP-RTK/INS/vision.


#### **5. Discussion**

GC-PPP-RTK/INS/vision integration with a cascading filter can provide continuous positioning information with high precision. The improvements brought about by vision are more significant in challenging environments. Vision reduces the error divergence of PPP-RTK/INS when the GNSS signals are blocked. The positioning error of PPP-RTK/INS reaches 27.16 m after the GNSS signals were lost for 60 s. The inclusion of vision reduces the positioning error to 4.88 m. Vision also helps to improve the estimation accuracy of velocity and attitude. Although the improvements are significant, the positioning accuracy still needs to be improved for vehicle navigation.

Dynamic objects can seriously affect the positioning performance of vision, so it is important to weaken the influence of dynamic objects. The dynamic object removal algorithm proposed in this paper can remove most fast-moving objects and can improve the positioning accuracy of vision. However, it is difficult to deal with slow-moving objects, which is worth further research.

Only GPS and BDS-2 were used in our test so the inclusion of BDS-3 and other systems will be the subject of further research. The integration model of PPP-RTK/INS/vision proposed in this paper is realized by a cascading filter, which can still work when one subsystem is seriously disturbed. The tightly coupled integration of PPP-RTK/INS/vision is another integration method that is worth studying. Because the update frequency of vision is higher, the tightly coupled integration of PPP-RTK/INS/vision may face a heavier computing burden. Therefore, there are still many problems worthy of study in the integration of PPP-RTK/INS/vision.

#### **6. Conclusions**

To improve the position, velocity, and attitude estimation performance in urban areas for vehicle navigation, a multi-GNSS PPP-RTK/INS/vision integration model with a cascading filter was developed and validated using two vehicular tests, T01 and T02, in urban areas. T01 was conducted in the suburban area of Wuhan City and T02 on the Second Ring Road of Wuhan city. The T02 test can be regarded as a typical GNSSchallenged environment. To obtain the atmospheric corrections and UPD products for PPP-RTK, observations from seven reference stations were also collected for generating those products.

A dynamic object removal model was also proposed and validated with T02. A dynamic object removal model based on position can work well in a GNSS-challenged environment and improve the positioning performance of multi-GNSS PPP-RTK/INS/vision.

PPP-RTK achieved centimeter-level positioning in the horizontal direction and decimeterlevel positioning in the vertical direction under a relatively open sky environment such as T01. The performance in the vertical direction was obviously improved when BDS was included with respect to G-PPP-RTK. Moreover, it took only about 30 s for PPP-RTK convergence due to the atmospheric augmentation and ambiguity resolution. However, incorrect ambiguity resolution remained and the position performance became significantly worse in this case. The introduction of the INS weakened the influence of the incorrect

ambiguity resolution and improved the positioning accuracy. The positioning error of GC-PPP-RTK/INS was 0.08 m and 0.09 m, with an improvement of 11.1% and 59.1% in the horizontal and vertical directions, respectively, in comparison with GC-PPP-RTK.

The performance of PPP-RTK degraded fast when the GNSS observation environments became complicated and challenging such as T02. G-PPP-RTK could only achieve meterlevel positioning in a GNSS-challenged environment. Compared with G-PPP-RTK, the positioning availability was improved from 90.5% to 95.1% for GC PPP-RTK. The GC-PPP-RTK/INS solution, in comparison with the GC-PPP-RTK solution, contributed significantly to the improvement of the positioning accuracy in the horizontal and vertical directions and the positioning availability. The positioning availability of GC-PPP-RTK/INS increased to 100%. The positioning error of GC-PPP-RTK/INS was 0.48 m and 0.73 m in the horizontal and vertical directions, respectively, after excluding the positioning results derived by the INS. The improvements were 31.4% and 37.1% in the horizontal and vertical directions, respectively, in comparison with GC-PPP-RTK.

The two-vehicle tests showed that GC-PPP-RTK/INS could realize high-precision continuous positioning in a relatively open-sky environment but the positioning errors diverged to more than 20 m in a GNSS-challenged environment. Thus, it needed other sensors such as vision to help restrict the error divergence. Vision did not improve the positioning accuracy statistically but further reduced the fluctuation slightly in the vertical direction for T01. The results indicated that the position RMS of the GC-PPP-RTK/INS/vision tightly coupled integration were 0.08 m and 0.09 m in the horizontal and vertical directions, respectively, which could fully meet the demands of vehicle navigation in urban areas. However, the introduction of vision significantly improved the positioning performance in both the horizontal and vertical directions for T02. The RMS of GC-PPP-RTK/INS/vision reached 0.83 m and 0.91 m in the horizontal and vertical directions, respectively. It improved the positioning accuracy by 54.9% and 7.1% in the horizontal and vertical directions, respectively, compared with GC-PPP-RTK/INS. Additionally, the velocity and attitude estimation performance were also analyzed in this paper. The inclusion of vision improved the velocity performance by more than 25% in the north, east, and down directions in a GNSS-challenged environment. As for attitude, there was no obvious difference with vision in the roll and pitch angles, but GC-PPP-RTK/INS/vision performed much better in the estimation of the yaw angle. The improvements brought about by vision were more than 30% in both tests.

The results show that GC-PPP-RTK/INS/vision integration with a cascading filter performs best in the position, velocity, and attitude estimations compared with the other solutions. Multi-GNSS, INS, and vision can play their respective roles and achieve complementary advantages in vehicle navigation in urban areas. However, navigation performance in real-time still deserves further study.

**Author Contributions:** S.G., C.D. and W.F. carried out the research and the experiment. F.M. helped to solve the atmosphere augmentation products and UPD products; S.G. and C.D. analyzed the results and drafted the paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China, grant number 42174029.

**Data Availability Statement:** The GNSS observation, region atmosphere augmentation products, UPD products, IMU, and vision measurement data can be accessed from sftp://59.172.178.34:10016 (accessed on 10 August 2022), username: tmp-user, password: Multi-sensor@FUSING. For example, input the command "sftp -P 10016 tmp-user@59.172.178.34".

**Conflicts of Interest:** The authors declare no conflict of interest.
