*3.2. Sensor Model*

As previously stated, a single camera is the main sensor utilized for map building in our SLAM system. Monocular cameras are common components of off-the-shelf UAVs. Among their advantages which can be considered the most beneficial in the field of robotics are their small size and low cost, as well as low power consumption. Moreover, the update rates and resolutions of these cameras are sufficient to track environmental changes during motion at velocities up to tens of meters per second. These features are the reason that monocular cameras are widely used for localization, structure from motion, mapping, SLAM etc. However, one can also identify consequential drawbacks of the single-camera usage. The main disadvantage, which has to be addressed when analyzing the monocular sensor model, is the lack of depth information. In other words, the perspective projection that transforms 3D real-world points into 2D camera pixels coordinates is lossy and causes a pose calculation problem for extracted real-world features. As landmark 3D positions cannot be straightforwardly computed, the approach to resolve this difficulty needs to be adopted. To address this issue, a given algorithm can either initialize a landmark immediately after its first observation (where the uncertainty of measured depth is set as significantly larger than for other coordinates), or delay the feature detection until observations from different perspectives provide conditions for temporal view stereoscopy analysis, to estimate information about the landmark's full, 3D pose. The first approach is called an undelayed initialization, while the latter is commonly referred to as delayed landmark initialization.

The adopted sensor model, together with the initialization strategy, starts with a single-camera frame registration. The camera's intrinsic parameters are known and can be denoted as **K***intr*:

$$\mathbf{K}\_{intr} = \begin{bmatrix} f\_x & s & u\_0 \\ 0 & f\_y & v\_0 \\ 0 & 0 & 1 \end{bmatrix} \tag{3}$$

where *fx* and *fy* represent focal lengths along the camera's axis and are equal for a pinhole camera model, *u*0 and *v*0 are principal point offset and *s* is the camera axis skew.

To extract distinctive scene points from images, we use the SURF algorithm [21]. The SURF detector outputs a set of pixel coordinates in a given image frame, where the determinant of a Hessian matrix reaches its maximum value. It is worth noting that SURF points are scale and rotation invariant. Exemplary extraction procedure results are presented in Figure 1.

**Figure 1.** Frame from a downward-looking camera with extracted SURF features.

To represent extracted scene points in the ENU coordinate frame, we use the concept of anchored modified-polar points, which can also be referred to as inverse-distance points (IDP) [45]. A single IDP point is defined by a six-element vector:

$$\begin{aligned} ^0\mathbf{m}\_k^i &= \begin{bmatrix} \ x\_0 \\ y\_0 \\ z\_0 \\ \varepsilon \\ \alpha \\ \rho \end{bmatrix} \end{aligned} \tag{4}$$

where *<sup>p</sup>***m***<sup>i</sup> k* is the state of the *i*-th landmark observed by the *p*-th particle at the time *k*. The first three elements of the state vector (*<sup>x</sup>*0, *y*0, *<sup>z</sup>*0) are the ENU coordinates that encode the position of the particle from which the landmark was originally observed (the point *<sup>p</sup>***p**0 in Figure 2.). These coordinates are frequently referred to as the anchor point. Next, *ε* and *α* are respectively the elevation and azimuth angles at which the observation was made, while *ρ* = 1*d* is the inverse of the distance between the camera and the scene point.

**Figure 2.** IDP landmark parametrization.

Figure 2 presents the idea of IDP landmark parametrization. The usage of anchored modified-polar points allows one to initialize landmarks in the map immediately. Still, the monocular-camera, salient-feature extraction step provides only two-dimensional measurements of three-dimensional objects' locations, where the distance *d* remains unknown. To resolve the issue of being unable to recover the true localization of the environment features, we choose the strategy of setting an initial inverse depth, of every registered landmark, as a preset value with reasonably large uncertainty. As the camera is pointed downwards, the starting depth value is either assumed equal to the UAV's altitude or is calculated using positions of nearby, previously seen scene points. The standard deviation of such observation is selected in a way so as to include infinite distance (inverse depth equal to 0) in the 3*σ* region. Using the undelayed initialization scheme allows one to comply with the adopted discrete-time Markov chain approach, and satisfies the property of a memoryless process, such that the currently processed state is sufficient to estimate the probability distribution of future states.

The first step of the initialization of a newly observed landmark is to transform its coordinates, expressed in pixels of the image plane, using the pinhole camera model. The inverse camera projection is performed in accordance to the equation below:

$$
\begin{bmatrix} \mathbf{x}\_{cam} \\ \mathbf{y}\_{cam} \\ \mathbf{1} \end{bmatrix} = \mathbf{K}\_{intr}^{-1} \begin{bmatrix} u \\ v \\ \mathbf{1} \end{bmatrix} \tag{5}
$$

where *u* and *v* are pixel coordinates in a registered image frame while *xcam* and *ycam* are coordinates in the standard reference frame of the camera. Next, by transforming the resultant vector so that it is expressed in the global reference frame (ENU), the optical ray pointing from the camera center *<sup>p</sup>***p**0to the extracted scene point is obtained:

$$
\begin{bmatrix} x\_{\epsilon m u} \\ y\_{\epsilon m u} \\ z\_{\epsilon m u} \end{bmatrix} = \prescript{cnu}{}{\mathbf{R}}\_{\epsilon \text{cam}} \begin{bmatrix} x\_{\epsilon \text{cam}} \\ y\_{\epsilon \text{cam}} \\ 1 \end{bmatrix} \tag{6}
$$

where *enu***R***cam* encodes the rotation from the camera reference frame to ENU coordinates. The usage of IDP parametrization implies the lack of need for the employment of 3D coordinates in a homogeneous form, as the obtained vector is subjected only to rotation.

The information that is conventionally contained in translation—when using 3D Euclidean points as landmark representation—is encoded in the anchor point *<sup>p</sup>***p**0.

Next, the vector's ENU coordinates are expressed using a modified-polar point convention:

$$
\begin{bmatrix}
\varepsilon \\
\alpha \\
\rho
\end{bmatrix} = \begin{bmatrix}
atani2(z\_{cmu}\sqrt{\mathbf{x}\_{cmu}^2 + y\_{cmu}^2}) \\
atani2(y\_{cmu}\mathbf{x}\_{cmu}) \\
\frac{1}{4} \end{bmatrix} \tag{7}
$$

where *ε* and *α* are respectively the elevation and yaw angles and *ρ* is the inverse depth. The addition of the anchor point results in the acquiring of the complete IDP landmark parametrization:

$$\begin{aligned} ^p\mathbf{m}\_k^i = \begin{bmatrix} \ ^p\mathbf{p}\_0\\ \ \varepsilon\\ \alpha\\ \rho \end{bmatrix} = \begin{bmatrix} \begin{bmatrix} \mathbf{x}\_0\\ y\_0\\ z\_0 \end{bmatrix} \\\ atani &2\left(z\_{\varepsilon mu}\sqrt{\mathbf{x}\_{\varepsilon mu}^2 + y\_{\varepsilon mu}}\right) \\\ atani &2\left(y\_{\varepsilon mu}, \mathbf{x}\_{\varepsilon mu}\right) \\\ \frac{1}{d} \end{bmatrix} \end{aligned} \tag{8}$$

As landmarks are stored using separate EKFs, the initialization procedure has to comprise the calculation of the landmark covariance matrices as well. A covariance matrix *p***P***ik* that describes the uncertainty of a transformation of the 2D point extracted from an image frame to its IDP representation is given by the following formula:

$$\begin{aligned} \,^p\mathbf{P}\_k^i = \left( ^p\mathbf{H}\_k^i \right)^{-1} \mathbf{R}\_k^p \left[ \left( ^p\mathbf{H}\_k^i \right)^{-1} \right]^T + \begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & \sigma\_\rho^2 \end{bmatrix} \end{aligned} \tag{9}$$

where **R***pk* is a two-by-two sensor-noise matrix, describing the accuracy of the scene point localization in the image plane, and *p***H***ik*−<sup>1</sup> follows the ordinary EKF notation and denotes a Jacobi matrix of the inverse observation function that describes the transformation from pixel coordinates to the ENU coordinate system. *p***H***ik*−<sup>1</sup> is given by the following formula:

$$\left(^{p}\mathbf{H}\_{k}^{i}\right)^{-1} = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \prescript{p}{}{\mathbf{J}}\_{k}^{i \, cmu} \mathbf{R}\_{cam} \mathbf{K}\_{intr}^{-1} \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{bmatrix} \tag{10}$$

where *p***J***ik* is the Jacobi matrix of a function that transforms the vector pointing from the *<sup>p</sup>***p**0 to the *i*-th landmark from 3D ENU Cartesian coordinates into modified spherical coordinate system representation and is equal to:

$${}^{P}
\mathbf{J}\_{k}^{i} = \begin{bmatrix}
\frac{\mathbf{x}\_{mu}\mathbf{z}\_{mu}\left(\mathbf{x}\_{e\mathbf{m}}\frac{2}{\mathbf{x}\_{e}} + \mathbf{y}\_{e\mathbf{m}}\frac{2}{\mathbf{x}\_{e}} + \mathbf{z}\_{e\mathbf{m}}\right)^{2}}{-\sqrt{\mathbf{x}\_{e\mathbf{m}}\frac{2}{\mathbf{x}\_{e}} + \mathbf{y}\_{e\mathbf{m}}\frac{2}{\mathbf{x}\_{e}}}} & \frac{\mathbf{y}\_{e\mathbf{m}}\mathbf{z}\_{e\mathbf{m}}\left(\mathbf{x}\_{e\mathbf{m}}\frac{2}{\mathbf{x}\_{e}} + \mathbf{y}\_{e\mathbf{m}}\frac{2}{\mathbf{x}\_{e}} + \mathbf{z}\_{e\mathbf{m}}\right)}{-\sqrt{\mathbf{x}\_{e\mathbf{m}}\frac{2}{\mathbf{x}\_{e}} + \mathbf{y}\_{e\mathbf{m}}\frac{2}{\mathbf{x}\_{e}}}} & \frac{\sqrt{\mathbf{x}\_{e\mathbf{m}}\frac{2}{\mathbf{x}\_{e}} + \mathbf{y}\_{e\mathbf{m}}\frac{2}{\mathbf{x}\_{e}}}}{\mathbf{x}\_{e\mathbf{m}}\frac{2}{\mathbf{x}\_{e}} + \mathbf{y}\_{e\mathbf{m}}\frac{2}{\mathbf{x}\_{e}}} & 0 \\ 0 & 0 & 0 \\ \end{bmatrix} \tag{11}$$

During camera movement, in subsequent time steps, the sensor model is able to retrieve the bearings of extracted features from consecutive, multiple views. The SLAM algorithm gradually estimates real 3D poses of landmarks, using standard EKF update equations given below:

$$\mathbf{^p y}\_k^i = \mathbf{z}\_k^{[i]} - h \left( \mathbf{^p m}\_k^{[i]} \right) \tag{12}$$

$$\mathbf{^pr\_k^i} = \mathbf{^pH\_k^i} \mathbf{^pP\_{k-1}^i} \left(\mathbf{^pH\_k^i}\right)^T + \mathbf{R\_k^p} \tag{13}$$

$${}^{p}\mathbf{K}\_{k}^{i} = {}^{p}\mathbf{P}\_{k-1}^{i} \left( {}^{p}\mathbf{H}\_{k}^{i} \right)^{T} \left( {}^{p}\mathbf{S}\_{k}^{i} \right)^{-1} \tag{14}$$

$$\prescript{p}{}{\mathbf{m}}\_{k+1}^{i} = \prescript{p}{}{\mathbf{m}}\_{k}^{i} + \prescript{p}{}{\mathbf{K}}\_{k}^{i}{}{\mathbf{y}}\_{k}^{i} \tag{15}$$

$$\mathbf{^pr\_k^i} = \left(\mathbf{I\_{6x6}} - \mathbf{^pr\_k^i} \mathbf{^pH\_k^i}\right) \mathbf{^pP\_{k-1}^i} \tag{16}$$

where *<sup>p</sup>***y***ik* is the measurement innovation (of the *i*-th landmark seen by the *p*-th particle), **z**[*i*] *k* is the measurement itself and *h* is the nonlinear vector function that describes the projection of a predicted scene point location from the 3D ENU coordinates to the image plane. It is the inverse of the transformation described in the Equation (5) through (11). Furthermore, *<sup>p</sup>***S**i*k*, *<sup>p</sup>***H**i*k* and *p***K***ik* are respectively the residual covariance, the Jacobian of the vector function *h*, and the Kalman gain—calculated for a given particle-landmark pair—during time step *k*.

#### *3.3. Sequential Importance Resampling (SIR) Particle Filter*

A particle filter is a mathematical tool capable of the accurate estimation of non-linear, non-Gaussian and multimodal distributions. Among different PF variants, the sequential importance resampling (SIR) approach is suited best for SLAM applications, and most of the previously mentioned works use it to address the simultaneous localization on the mapping problem. Our SLAM algorithm is built upon a SIR PF as well. The filtering operation is conducted by drawing a weighted set of samples that are generated in accordance with a predefined distribution from a set of particles from a previous time step. To address the issue of a particle filter degeneracy, a resampling procedure is run periodically to exclude samples with the lowest weights—i.e., less probable hypotheses.

To achieve a robust and efficient filtering procedure, the proposal's distribution of samples has to match the desired distribution as closely as possible. Therefore, particles representing poses of the camera are sampled at a frequency equal to the IMU data sample rate. This process is performed in accordance to the motion model described by Equation (2), and its PDF is assumed to be in the following form:

$$
\psi(\mathbf{x}\_k|\mathbf{x}\_{k-1}, \mathbf{u}\_k) \tag{17}
$$

Performing SLAM, rather than simple navigation, indicates the inclusion of a set of extracted landmarks in the filtering process. The straightforward approach of recovering the momentary camera pose and the map of its surroundings with a particle filter is given by the joint posterior:

$$p\left(\mathbf{x}\_{k\prime}\mathbf{m}\_k^{[1:M]}\Big|\mathbf{x}\_{k-1\prime}\mathbf{u}\_{k\prime}\mathbf{z}\_k\right)\tag{18}$$

where *m*[1:*M*] *k* is a set of all *M* landmarks observed up to time *k*. The number of particles required to estimate the PDF accurately grows exponentially, as every newly added landmark increases the number of dimensions of the state space that has to be sampled. Therefore, for SLAM systems designed to navigate robustly and effectively through large areas, a change in this approach is necessary. To solve the issue of a rapid increase of the number of required particles, PF SLAM systems commonly benefit from the fact that the estimations of observed scene points can be treated as conditionally independent, if the robot trajectory is assumed to be known. The application of this relationship results in a special case of the Rao–Blackwellization (RB) of a particle filter and is based on marginalizing landmarks out of the state vector [20]. The standard RB PF is the mathematical basis of the adopted SLAM

framework. The consequence of state-vector size reduction is a decrease in the number of samples needed to perform a SLAM routine accurately. The resulting joint posterior is:

$$p\left(\mathbf{x}\_{k'}\boldsymbol{m}\_{k}^{[1:M]}\Big|\mathbf{x}\_{k-1'}\boldsymbol{u}\_{k'}\boldsymbol{z}\_{k}\right) = p(\mathbf{x}\_{k}|\mathbf{x}\_{k-1'}\boldsymbol{u}\_{k})p\left(\boldsymbol{m}\_{k}^{[1:M]}\Big|\mathbf{x}\_{k'}\boldsymbol{z}\_{k}\right) \tag{19}$$

and can be further factored out as:

$$p\left(\mathbf{x}\_{k'}\boldsymbol{m}\_{k}^{[1:M]}\Big|\mathbf{x}\_{k-1'}\boldsymbol{u}\_{k'}\boldsymbol{z}\_{k}\right) = p(\mathbf{x}\_{k}|\mathbf{x}\_{k-1'}\boldsymbol{u}\_{k})\prod\_{i=1}^{M}\left(\boldsymbol{m}\_{k}^{i}\Big|\mathbf{x}\_{k'}\boldsymbol{z}\_{k}\right)\tag{20}$$

The implementation of this conceptual solution to Rao–Blackwellized SLAM is performed by the division of the estimation task among different filters. The main particle filter models the camera trajectory with a number of weighted samples, while landmark positions are estimated using EKFs, whose accuracy determine particle weights. Every particle has to maintain a separate EKF for every observed scene point. This implicates the computational complexity of *O*(*MN*) for *<sup>N</sup>*-particle distribution, describing the pose of a camera and an *M*-landmark map.
