**2. Related Work**

The earliest idea of a SLAM framework employing a Rao–Blackwellized particle filter was briefly discussed in [20]. The authors advocated the use of Bayesian inference in order to achieve autonomous localization and mapping capabilities. Undoubtedly, the most well-known solution to a Rao–Blackwellized particle filter SLAM was presented in [6], where lidar was used to build a grid map. Many later systems have been designed upon FastSLAM (both 1.0 and 2.0 [7]) frameworks—especially those utilizing rangefinders as main sensors [21–23].

Rapid progress in computer vision and computational capacity led to the proliferation of cameras in the field of robotics. Among the first systems that employed a particle filter and a camera to perform simultaneous localization and mapping, one should list [24]. The algorithm described in that manuscript processed edges, found with the Sobel mask, as landmarks, but did not benefit from the fact that different observations can be treated as probabilistically independent if one knows the camera position and orientation. As a result, 10,000 particles had to be used to estimate the state of the camera pose, together with only 15 landmarks. In [25], the authors implemented an algorithm that allowed one to sequentially approximate the full 6DoF posterior of a camera, together with up to eight, tracked 3DoF scene points. This method was successfully validated using 500 particles; however, a small set of previously known landmarks was required for it to work properly. Another system, developed by Sim et al. [26], is an early example of a Rao–Blackwellized particle filter framework that was constructed using a camera as the main sensor. With the usage of the SIFT algorithm [27], the authors' indoor mobile robot was able to successfully track more than 11,000 landmarks. Later, Eade and Drummond [15] constructed an algorithm capable of camera-tracking with as little as 50 particles and implemented the ingenious idea of inverse depth [28] as a third element of a 3DoF landmark-state vector. In [29], the authors extracted landmarks using speeded-up robust features (SURF) [30] and applied a global optimization algorithm to achieve optimal matching for its scene points.

While the earliest formulations of Rao–Blackwellized particle filters in a FastSLAMlike framework assumed that map elements are estimated using EKFs, other monocular SLAM approaches implemented alternative nonlinear Kalman filtering strategies, namely the UKF. In [31], the authors exploited the spatial structure of the environment and developed an algorithm that searched for and extracted locally planar objects as landmarks, whilst Lee [32] introduced a template prediction mechanism to compensate for camera motion. Both mentioned systems employed UKFs for landmark storage to overcome the issue concerning Jacobians approximations in EKFs.

The profound analysis presented in [9] marked a milestone in the SLAM systems domain, as more researchers have tended to shift towards keyframe simultaneous localization and mapping approaches over the last decade. Still, there have been numerous examples of efficient PF SLAM methods implemented since. One of them was proposed in [33] as a tool for pose-tracking for augmented reality. The algorithm implemented an idea to discard outliers indirectly—not during the data association-and-gating phase but after the particle-weighting procedure, as incorrect matches lower sample weights significantly, thus minimizing chances for a given particle to be resampled. The authors used both lines and points, extracted with a Harris corner detector [34] and Hough line transform, respectively. A different particle filter-based solution to the SLAM problem was outlined in [35] for an indoor aerial vehicle. Asserting that the robot was designed to navigate only inside manmade structures, that system exploited the abundance of straight lines in the camera images and facilitated human-like procedures to predict landmark depths. The ranging strategy assumed that the monocular camera altitude was known, and used this information to process relative poses of observed geometric structures in

order to synthesize a simultaneous localization-and-mapping algorithm, similar to the FastSLAM approach.

As mentioned before, the monocular camera-observation model suffers greatly from a lack of depth information. Contrary to monocular-camera algorithms, RGBD-camerabased-SLAM approaches are able to directly initialize landmark depths using single-sensor reading. One paper [36] presents a remarkable stereo-camera-particle-filter-SLAM solution, where the authors proposed a smart procedure for outlier identification by landmarkposition correlation analysis. Moreover, landmarks are efficiently detected and matched using the SURF algorithm. To tackle the unknown depth issue otherwise, one can place a pattern of known dimensions inside the camera field of view. In [37], the authors proposed the insertion of a chessboard inside the first few camera frames to allow an accurate depth estimation as well as a reduction of camera pose uncertainty for a monocular-camera-based SLAM system. To calculate the depth of subsequently observed landmarks, the described algorithm delayed their initialization until a triangulation procedure could be carried out.

In reference to SLAM being strictly a perception problem, one would intuitively seek its refinement in the modification of a given observation model. However, Zhou and Maskell proposed an improvement to the FastSLAM framework based on the motion model revision. In [38], the estimation of system dynamics is partitioned into two submodels. The camera location was calculated with a particle filter, and its velocity with a Kalman filter. This idea allowed one to reduce the particle filter's dimensionality, as well as to achieve a better accuracy than one would with an analogous solution constructed upon the classic FastSLAM framework.

Further examples of particle-filter SLAMs also include non-pure-visual systems, where data from range finders are fused with images. Chen et al. [39] implemented a system wherein an urban search-and-rescue robot navigates using 2D lidar in a feature-based 3D map, constructed with a monocular camera. This allows for the obtaining of a real scale of the surroundings, as well as the maintenance of full, 6DoF motion and mapping capabilities. In [40], the authors propose a system wherein one robot performs the camera SLAM while others reuse the resulting map for simultaneous localization-and-map-scale estimation.

Among additional instances of PF-based systems in the field of robotics, recent works not related directly to the SLAM problem should be listed as well. The manuscript by Acevedo et al. [41] characterizes an algorithm which employs a particle filter to enable a group of networking robotic entities to search for a moving target. In [42], PF is used as a framework to solve 6DoF, visual pose-tracking, where Rao–Blackwellization is introduced by decoupling the translation and orientation data. In a paper by Di Yuan et al. [43], a PF-based system for redetection in the object-tracking approach for accurate localization in difficult conditions is described. In [44], the authors describe a self-localization technique that employs a PF, based on particle swarm optimization, that requires fewer particles to function correctly than comparable benchmark approaches.

#### **3. Materials and Methods**

Although the method that we propose in this paper can be easily adjusted to any monocular camera configuration, our SLAM framework was designed and validated under the assumption it will serve as a secondary navigational system for a surveillance UAV with its monocular camera pointed downward. Based on inertial-measurement-unit (IMU) measurements and subsequent camera frames, we were able to synthesize the registered data into a correct trajectory, together with a sparse map of the observed area.

## *3.1. Motion Model*

To minimize the number of reference frames needed, our system directly estimates the pose (position and orientation) of a camera sensor, rather than the pose of a UAV itself. The state vector **<sup>x</sup>***k* during time step *k* consist of nine variables:

$$\mathbf{x}\_k = \begin{bmatrix} \times \\ \times \\ y \\ z \\ \upsilon\_x \\ \upsilon\_y \\ \psi\_z \\ \theta \\ \theta \\ \psi \end{bmatrix} \tag{1}$$

where *x*, *y*, *z* represent a localization in Cartesian coordinates, *vx*, *vy*, *vz* are orthogonal components of a velocity vector and *φ*, *θ* and *ψ* are the roll, pitch and yaw orientation angles respectively. We choose the east-north-up (ENU) coordinate system as the global reference frame, where *x* is east, *y* is north and *z* is up. The source of camera orientation, relative to the ENU frame, is obtained from the onboard IMU.

While the camera itself has six degrees of freedom (6DoF), the bearing-only observation model causes monocular SLAM to be a 7DoF problem, where a map representation can be determined only up to a scale. To mitigate the issue of scale ambiguity and drift, which is an inherent problem of single-camera SLAM frameworks, we chose to select the onboard IMU as an additional sensor. It is used as the source of a stream of metric measurements that are control signals for trajectory prediction. The employment of an inertial measurement unit implicates the usage of a constant velocity (CV) motion model as the most appropriate. The discretized motion equation describing the camera movement relationships is given below:

$$\mathbf{x}\_{k} = \mathbf{f}(\mathbf{x}\_{k-1}, \mathbf{u}\_{k}) + \mathbf{w}\_{k} \tag{2}$$

where **<sup>x</sup>**k is the predicted state vector containing estimates of the platform's kinematic parameters, **<sup>x</sup>**k−1 is the state vector estimated during the previous time step *k* − 1, **f** is the state-transition nonlinear vector function, **u***k* represents IMU readings, and **<sup>w</sup>***k* is the additive Gaussian process noise which can be described by a zero-mean multivariate normal distribution *N*(0, **Q***k*) where **Q***k* denotes the process covariance matrix.
