A Review on Visual-SLAM: Advancements from Geometric Modelling to Learning-Based Semantic Scene Understanding Using Multi-Modal Sensor Fusion

Lai, Tin

doi:10.3390/s22197265

Open AccessReview

A Review on Visual-SLAM: Advancements from Geometric Modelling to Learning-Based Semantic Scene Understanding Using Multi-Modal Sensor Fusion

by

Tin Lai

School of Computer Science, The University of Sydney, Camperdown, NSW 2006, Australia

Sensors 2022, 22(19), 7265; https://doi.org/10.3390/s22197265

Submission received: 28 August 2022 / Revised: 12 September 2022 / Accepted: 19 September 2022 / Published: 25 September 2022

(This article belongs to the Special Issue Simultaneous Localization and Mapping (SLAM) for Mobile Robot Navigation)

Download

Browse Figures

Versions Notes

Abstract

:

Simultaneous Localisation and Mapping (SLAM) is one of the fundamental problems in autonomous mobile robots where a robot needs to reconstruct a previously unseen environment while simultaneously localising itself with respect to the map. In particular, Visual-SLAM uses various sensors from the mobile robot for collecting and sensing a representation of the map. Traditionally, geometric model-based techniques were used to tackle the SLAM problem, which tends to be error-prone under challenging environments. Recent advancements in computer vision, such as deep learning techniques, have provided a data-driven approach to tackle the Visual-SLAM problem. This review summarises recent advancements in the Visual-SLAM domain using various learning-based methods. We begin by providing a concise overview of the geometric model-based approaches, followed by technical reviews on the current paradigms in SLAM. Then, we present the various learning-based approaches to collecting sensory inputs from mobile robots and performing scene understanding. The current paradigms in deep-learning-based semantic understanding are discussed and placed under the context of Visual-SLAM. Finally, we discuss challenges and further opportunities in the direction of learning-based approaches in Visual-SLAM.

Keywords:

Visual SLAM; Simultaneous Localisation and Mapping; deep-learning SLAM; semantic understanding; camera; vision; sensors fusion; mobile robots navigation

1. Introduction

Autonomous navigation in mobile robots has become an active research field in recent years due to the advancements in material science for robot construction, compact battery size for more prolonged duration remote operations, and the increases in computational hardware for powering algorithmic and artificial intelligent methods. Mobile robots are capable of navigating within some environments to perform their respective objectives. Most autonomous robots need to move around in the environment while respecting the environment. Robots “see and understand” the world through collecting information from the attached sensors and making sense of the readings. Specifically, for robots to interact with the real world, they need some capabilities of understanding the scene geometrically and interpreting it semantically. Accurate localisation of the robot is essential for the robot to compute suitable actions, and having a good knowledge or perception of the environment allows the robot to react to its surroundings.

Mobile robots typically receive sensor information from their attached sensors [1], for example, in the form of 2D projections of image frames or 3D spatial points from high-frequency LiDAR scans [2]. However, this perceived information is often insufficient for the robot to navigate as it lacks the geometric understanding and reconstruction of the scene. Geometric modelling is especially essential in complex tasks where the robot needs to localise itself with respect to the modelled map of the scene to navigate and accomplish its objectives. For example, mobile robot navigation often requires the robot to maintain a map representation for performing planning in tasks such as motion planning [3], navigating mobile robots [4], moving robotics arms [5], or even autonomous cars [6]. It is infeasible for the robot to perform safe autonomous operations in a dynamic environment if it lacks the ability to perceive and make sense of the potential obstacles. This is especially important for the robot to operate in a novel region without any prior information about the environment, for example, in planetary exploration or search and rescue operations.

Difficulty in the geometric reconstruction of the environment while navigating within the unknown environment often arises due to issues such as sensor modalities, misalignment of map representations, or observation noises [7,8,9]. Simultaneous Localisation and Mapping (SLAM) addresses the problem of an online incremental process where the mobile robot needs to refine the reconstructed map and its current location iteratively by observing more of the unknown environment using various sensors [10]. In Visual-SLAM, the most straightforward representation of the environment often consists of a collection of sparse 3D points processed continuously during the navigation. A robotic system collects visual inputs for Visual-SLAM while operating in some unknown environment [11], which constructs a map of the surroundings using various sensors while simultaneously estimating its position concerning the environment. The constructed map can be used for mapping a novel environment or for the robot to plan for its mission with autonomy. Such a system can maintain stability, plan for its own movements, and reacts to dynamic changes in the surroundings without human intervention [12].

SLAM Mobile robot semantic scene involves both the domain of robotics, computer vision and sensor technologies. A wide range of sensors can be used for SLAM. For example, most autonomous vehicles use Light Detection and Ranging (LiDAR) sensor [13] or stereo cameras [14] to perceive the surrounding environment during navigation. LiDAR can often provide a more accurate environment representation by providing a 3D point cloud with ranging measurements. Traditional LiDAR sensors are rarely used in consumer-grade mobile robots due to the high cost; however, the advancements in manufacturing have enabled LiDAR sensors to become more common in mobile robots navegation [15,16] In contrast, stereo cameras are more ubiquitous as the manufacturing cost is much lower than LiDAR. On the other hand, the hybridisation of multiple sensors is shown to enhance the localisation, and mapping performance in most SLAM approaches significantly [17]. Hybridisation visual sensors with classical proprioceptive sensors such as IMU or odometers can often reduce the localisation drift due to the cumulative error of these relative positioning approaches. Rather than having some superior type of sensors, different sensing methods have their relative strength and weaknesses [18,19]. For example, laser scanners in LiDAR are efficient for obstacle detection but are highly sensitive to weather conditions such as rain. In contrast, RGB cameras can extract semantic meaning from the captured images but are sensitive to lighting conditions. Therefore, most sensors are complementary, and it is an open research question of matching and balancing each type of sensor with their respective strengths and weaknesses.

Semantic scene understanding is neglected in traditional approaches, which only focus on the geometric reconstruction of the environment. Rather than treating the collecting points that carry no relationship, semantic understanding assigns higher-level meanings to the collected data [20]. Real-world environments often contain many structures and objects that carry high-level semantic information that is helpful to act as landmarks in SLAM. Assignment semantic meaning can be helpful in both reconstructing the scene by inferring missing information and providing complementary information for reconstructing the scene [21,22]. Moreover, reconstructing the geometric representation of the scene and their respective semantic meaning can be helpful for the mobile robot to make higher-level decisions on selecting landmarks that are suitable for the environment and informing the robot planner on deciding its mission route.

This review article provides an overview of the current state-of-the-art Visual-SLAM paradigms and models with a focus on sensor fusion. This paper is organised as follows. We begin by first providing a concise summary of the theory behind the SLAM process and the formulation of the geometric modelling of the surrounding environment. Then, we present an overview of the evolving SLAM paradigms throughout recent years, including both approaches in pure geometric reconstruction and semantic scene understanding using a deep-learning model. Finally, we discuss the current state-of-the-art Visual-SLAM models to understand our progress and future direction in SLAM.

2. Simultaneous Localisation and Mapping

Simultaneous Localisation and Mapping(SLAM) is a problem where a robot needs to operate in an unknown environment to construct a map while estimating its uncertain location [23]. SLAM is a fundamental problem in numerous robotics applications that needs the robot to autonomously navigate within some environment and interact with the real world. In the following, we will introduce the problem setup in SLAM, followed by a formalisation of the fundamental theory behind SLAM algorithms.

2.1. Problem Setup

The problem’s difficulty comes from the recursive dependency: constructing a map often depends on the robot observing the environment from some known location, while state estimation also often requires a robot to infer its location by relying on some known landmarks. SLAM algorithm estimates the sensor motion and simultaneously reconstructs the geometrical structure of the visited area. Chatila and Laumond [24] first formularised the problem setup in 1985 [24] for mobile robots navigation. The problem lies in the need to model the environment and locate itself correctly through the inaccuracies introduced by the sensors. The proposed methodology defines a general principle to deal with uncertainties in the collected data and for a mobile robot to define its reference landmarks while exploring the environment.

The fundamental idea of SLAM lies in using landmark correlations, data association, and loop-closure to reduce the uncertainties about its previously visited area and poses [25]. Traditional techniques for sequential state estimation include Kalman filter [26]. Kalman filter is an optimal state estimation technique in a linear system with Gaussian noise. Practical implementations of SLAM often use the extended Kalman filter (EKF) for state estimation [27], which is advantageous because the Gaussian assumption allows EKF to be measured analytically. If the system has non-Gaussian noise, the Kalman filter is still the optimal linear filter but performs worse than other techniques. For nonlinear systems, methods such as particle filter [28] can be a more flexible alternative as it does not rely on any local linearisation technique or crude functional approximation. However, the higher performance comes with a higher computational effort than Kalman filters. In particle filtering, we need to perform weighted sampling to estimate the distribution of the robot state rather than having an analytical solution to obtain the robot state distribution by using the mean and covariance matrix in a Gaussian distribution.

There are multiple metrics to measure the benefit of actions. For example, A-optimality measures the trace of the covariance matrix [29], which is equivalent to minimising the mean squared error between the data and model parameters. D-optimality, on the other hand, minimises the determinant of the covariance matrix [30], which is equivalent to minimising the entropy of the SLAM system [31]. For example, we can utilise the building structure lines as features for localisation and mapping, which can encode the global orientation information constrains the robot’s heading over time. These features help eliminate the accumulated orientation errors and reduce the position drift in SLAM algorithms. In SLAM, the concept of loop closure can also reduce drift errors by allowing the robot to reset its estimated state by revisiting a known portion of the map [30]. Active SLAM methods often exploit this property by guiding the robot to regions that allow the robot to close the loop [32], which can significantly reduce the localisation error [33]. Autonomous navigation in an indoor environment often requires multiple sensory inputs and actuating outputs. Wheeled ground mobile robots are often designed with a differential drive base that uses DC geared or stepper motors for their driving wheels. Mobile robots collect data from onboard sensors such as wheel encoders, initial measurement units (IMU), RGB cameras for visual inputs, or LiDAR as remote sensing to measure ranges.

Loop-closing can detect if a given keyframe had been seen previously [34,35]. Loop Closure can be formulated as an optimisation problem, such as a nonlinear least squares problem that matches the current scans with previously visited areas. One reason that loop closing is hard in SLAM is that the internal estimates can, despite best efforts, be in gross error. Loop closing is essentially a data association problem where a positive loop closure occurs when the robot recognises the local scene to be one that it has previously visited. Traditional feature-based SLAM uses simple geometric primitives such as corners or lines as features. When a loop closure is detected, it acts as an opportunity to constrain the robot’s internal estimate of its current state with respect to the map.

2.2. SLAM Formulation

SLAM is a multi-discipline problem that spans both the computer vision and robotics domain and is traditionally formulated as a maximum a posterior (MAP). In Visual SLAM, we define

X = {X_{i}^{w}}_{i = 1}^{N}

as the trajectory of the robot over time, where

X_{i}^{w}

denote the pose of the robot parameterised in the set of rigid Euclidean transformations

SE (3)

. Let

L = {l_{j}}_{j = 1}^{M}

denote the set of landmarks parameterised by their appropriate representation space,

Z = {z_{k}}_{k = 1}^{K}

be the set of observations of the detected landmarks, and

U = {u}_{i = 1}^{N - 1}

be the set of odometry measurements between robot poses. The observations

Z

of the landmarks are collected under some observation model

h_{k} (\cdot)

, given by

z = h_{k} (X_{i_{k}}, l_{j_{k}}) + ϵ_{k}

(1)

where

X_{i_{k}}

and

l_{j_{k}}

denote the actual robot state and landmark pose, and

ϵ_{k}

is a random measurement noise. The solution to the SLAM problem is the optimal MAP estimation of

X^{*}, L^{*} = \underset{X, L}{\arg \max} P (X, L | Z, U)

(2)

where

P (X, L | Z, U)

is the joint probability of all latent estimate variables given all of our previous observations and measurements. For a classical SLAM problem without odometry measurments [36], we can rewire (2) as

\begin{array}{l} X^{*}, L^{*} & = \underset{X, L}{\arg \max} P (X, L | Z) \end{array}

(3)

\begin{array}{l} = \underset{X, L}{\arg \max} P (Z | X, L) P (X, L) \end{array}

(4)

where

P (Z | X, L)

is the likelihood of the obtaining the measurement

Z

given

X

and

L

, and

P (X, L)

being the prior knowledge on

X

and

L

. Assuming that each observation

z_{k}

is independent, we can then compute (4) as

\begin{matrix} X^{*}, L^{*} & = \underset{X, L}{\arg \max} \prod_{k = 1}^{K} P (z_{k} | X_{k}, L_{k}) P (X, L) . \end{matrix}

(5)

2.3. Factor Graph and Loop-Closure

Factor graph [37] represents an essential part of modern approaches to address the probabilistic SLAM problem by factorisation of and inference over arbitrary distribution functions. A factor graph

G (V, F; E)

is a bipartite graph that determines the factorisation of variables from a global function into product of local functions. Specifically, the set of vertices

V

in the graph

G

represents the latent variables that participate in the estimation process. The set of factors

F

represents the prior knowledge regarding variable nodes and constraints between nodes, where the connections between nodes are represented by the set of edges

E

.

We can represent a classical SLAM problem as a factor graph as depicted in Figure 1, where the joint probability distribution of the MAP estimation is factorised as a product over observation factors. Using the factor graph notation, we can rewrite the MAP formulation in (2) as

\begin{matrix} X^{*}, L^{*} & = \underset{X, L}{\arg \max} P (X, L | Z) \end{matrix}

(6)

\begin{matrix} = \underset{X, L}{\arg \min} \sum_{k = 1}^{K} {∥h_{k} (X_{i_{k}}^{w}, l_{j_{k}}) ⊖ z_{k}∥}_{Σ_{k}}^{2} \end{matrix}

(7)

where

h_{k}

denote the kth factor of observing a landmark

l_{j}

from the camera pose

X_{i}^{w}

with the sensor model

z_{k}

, the notation

{∥\cdot∥}_{Σ}^{2}

denote the squared Malnalanobis norm with covariance matrix

Σ

, and ⊖ is the difference operator in the target measurement space.

The SLAM problem can be formulated as a Bayes net under the factor graph formulation as factorisation and inference over probability distribution and functions [36,38,39,40,41]. A factor graph is a bipartite graph that characterises how a global multi-variable function can be factorised into a product of local functions. Each blue and green node in Figure 1, also known as variables, represents the set of latent variables that need to be estimated, which in the case of SLAM are the state of the robots and the landmarks. The node in-between the variables are known as factors, which is the set of constraints and information between the variables. We can use a factor graph to factorise a joint probability distribution over some random variables by encoding the inherent conditional independence of some local variables into the joint probability distribution.

The joint probability distribution of all the latent estimate variables of the SLAM problem can be written as

P (X, L | Z, U) \propto P (X_{0}^{w}) \prod_{k = 1}^{K} P (z_{k} | X_{i_{k}}^{w}, l_{j_{k}}) \prod_{i = 1}^{N} P (X_{i}^{w} | X_{i - 1}^{w}, u_{i - 1})

(8)

where

P (X_{0}^{w}) \equiv P_{0}

is the prior belief on the robot’s initial pose. The

P (z_{k} | X_{i_{k}}^{w}, l_{j_{k}})

represents the effect of landmark observation

z_{k}

given the data association

(i_{k}, j_{k})

, and

P (X_{i}^{w} | X_{i - 1}^{w}, u_{i - 1})

represents the state update given the motion model. Assuming a zero-mean Gaussian observation noise for observation

Z

and odometry

U

, we can rewrite (8) as

P (X, L | Z, U) \propto \underset{effect of observations}{\underset{︸}{\prod_{k = 1}^{K} exp (- \frac{1}{2} {∥h_{k} (X_{i_{k}}^{w}, l_{j_{k}}) ⊖ z_{k}∥}_{Σ_{k}}^{2})}} \underset{effect of odometry}{\underset{︸}{\prod_{i = 1}^{N} exp (- \frac{1}{2} {∥f_{o} (X_{i - 1}^{w}, u_{i - 1}) ⊖ X_{i}^{w}∥}_{Σ_{o}}^{2})}}

(9)

where

h_{k}

is the sensor model,

f_{o}

is the motion model,

Σ_{k}

and

Σ_{o}

are the covariance matrix for the Gaussian noise in

Z

and

U

, respectively.

We can further factorise this joint probability distribution to obtain the optimal MAP estimation by solving the equivalent least-squares form of

\begin{array}{l} X^{*}, L^{*} & = \underset{X, L}{\arg \max} P (X, L | Z, U) \end{array}

(10)

\begin{array}{l} = \underset{X, L}{\arg \min} - log P (X, L | Z, U) \end{array}

(11)

\begin{array}{l} = \underset{X, L}{\arg \min} \underset{effect of observations}{\underset{︸}{\sum_{k = 1}^{K} {∥h_{k} (X_{i_{k}}^{w}, l_{j_{k}}) ⊖ z_{k}∥}_{Σ_{k}}^{2}}} + \underset{effect of odometry}{\underset{︸}{\sum_{i = 1}^{N} {∥f_{o} (X_{i - 1}^{w}, u_{i - 1}) ⊖ X_{i}^{w}∥}_{Σ_{o}}^{2}}}, \end{array}

(12)

which can be interpreted graphically as a factor graph as the one shown in Figure 1.

The depiction in Figure 2 indicates an instance of the classical bundle adjustment (BA) [42,43]. In BA, the factor graph’s variable nodes can be considered camera poses and 3D landmarks to minimise the re-projection error factors. BA applications use sensor information from odometry in mobile robots or IMU to further improve the accuracy of the estimated robot trajectory. In Figure 2, the loop-closure factors can be extended to higher-level entities that impose some sophicated constraints and factors. Loop-closure can often improve the consistency of the mapping results [33] as they act as additional constraints during the factorisation of the joint distribution [44].

3. Evolution of SLAM Techniques and Paradigms

Various approaches in SLAM have been proposed throughout the years to address different challenges within the approach. The following discusses the operational process of traditional SLAM algorithms and recent developments in SLAM paradigms. A summary of their algorithmic approach, advantages and shortcoming are provided as follows.

Fast-SLAM (2002) [45] addresses the localisation problem by using a decomposing strategy for recursively estimating the full posterior distribution over robot pose and landmark location. The algorithm performs exact factorisation of the posterior into a conditional landmark distribution and distribution over robot paths. Advantage: The complexity scales logarithmically with the number of landmarks on the map. Disadvantage: FastSLAM behaves like a non-optimal local search algorithm, where it is capable of producing consistent uncertainty estimates, but, in the long-term, it is unable to explore the state-space as a Bayesian estimator adequately.
Extended Kalman Filter (EFK) SLAM (2007) [46] uses a divide and conquer approach to estimate a consistent state estimation by using state covariance to represent the real error in the estimation process. EFK SLAM uses phase iterations of predictions, observation, and updates to perform state estimation in a Bayesian manner. Advantage: EFK SLAM often achieve a more consistent estimation than another approach as it computes the exact solution rather than using approximation, and the proposed approach tackles the combinatorial complexity. Disadvantage: Despite the consistent estimation, the approach uses a probabilistic inference approach for forecasting the current state, which might diverge from the actual current state.
V-SLAM (2011) [47] computes a locally dense stereo correspondences from the potentially sparse raw representation. The dense representation avoids the sparsity problem that often arises in operating SLAM with a sparse set of landmarks. Advantage: The computational overhead is relatively small, and the dense representation increases the robustness of the inner SLAM algorithm in a sparse environment. Disadvantage: The dense representation can be more sensitive to flaws or environmental changes to the environment.
Large-Scale Direct (LSD) SLAM (2015) [14] aligns images directly with photoconsistency of high-contrast pixels. LSD SLAM can concurrently estimate the depths at the pixels using static stereo and temporal multi-view stereo by utilising the camera motion. Advantage: The SLAM can operate directly at the pixel level rather than as a separate procedure for processing the captured images. Disadvantage: The procedure can be costly when computing the translational motion between frames.
ORB-SLAM2 (2017) [48] make use of multiple features from Monocular, Stereo and RGB-D cameras which greatly enhance the versatility of the method. The algorithm uses bundle adjustment to create a 3D environment by extracting features from different images and placing them in 3D. Advantage: ORB-SLAM2 is highly versatile and can perform sensor fusion to improve detection quality. The model includes loop-closure detection, keyframe selection and per-frame localisation, which enhance its robustness. Disadvantage: High processing cost, which might be costly for small systems.
2D-LiDAR SLAM (2018) [49] is an algorithmic approach that uses a laser sensor to create a 2D view of its surroundings. The method uses laser and visual fusion to provide localisation by combining two kinds of laser-based SLAM and monocular camera-based SLAM. Advantage: The fusion allows high performance in spotting complex structures such as hollow ceilings and can achieve high precision even at a significant distance range. Disadvantage: The 2D-LiDAR-based approach is highly sensitive to visibility conditions and performs poorly during poor weather conditions.
GRAPH-SLAM (2019) [50] utilises a stochastic gradient descent approach for nonlinear optimisation. GRAPH-SLAM uses radar sensors to perform point matching with ICP. Advantage: The approach uses the higher range and angular resolutions in radar for performing SLAM over long tracks. Disadvantage: GRAPH-SLAM can be sensitive in the choice of parameters and require fine-tuning.
Particle Filter SLAM (2020) [51] uses Monte Carlo sequence filtering method for maintaining an estimated distribution of the current robot state. Advantage: The filtering process is performed with state identification, mass modification and a resampling procedure. Disadvantage: It requires lots of particles to perform state estimation in an environment with a large spatial area; otherwise, the likelihood will be spatially separated with large separation.
Direct Sparse Mapping (DSM) (2020) [52] adopted the Photometric bundle adjustment (PBA) method for SLAM, which was shown to be effective for estimating scene geometry and camera motion in Visual Odometry (VO). Unlike PBA, which estimates the camera odometry with a temporary map, DSM can build a persistent map for SLAM usage. Advantage: DSM is a direct monocular VSLAM method that detects point observations and extracts the geometric information from the photometric formulation. Disadvantage: PBA is needed during the DSM procedure, significantly increasing the runtime processing cost.

4. Visual-SLAM

Visual-SLAM and sensors have been the main research direction for SLAM solutions due to their capability of collecting a large amount of information and measurement range for mapping. The principle of Visual-SLAM lies in a sequential estimation of the camera motions depending on the perceived movements of pixels in the image sequence. Besides robotics, Visual-SLAM is also essential for many enormous vision-based applications such as virtual and augmented reality. Many existing Visual-SLAM methods explicitly model camera projections, motions, and environments based on visual geometry. Recently, many methods have assigned and incorporated semantic meaning to the observed objects to provide a more successful localisation that is robust against observation noise and dynamic objects. In this section, we will review the different families of algorithms within the branch of Visual-SLAM.

4.1. Feature-Based and Direct SLAM

Feature-based SLAM can be divided into filter-based and Bundle Adjustment based methods introduced earlier in previous sections. Earlier SLAM approaches utilised EKFs for estimating the robot pose while updating the landmarks observed by the robots simultaneously [53,54,55]. However, the computational complexity of these methods increased with the number of landmarks, and they did not efficiently handle non-linearities in the measurements [56]. FastSLAM was proposed to improve the EKF-SLAM by combining particle filters with EKFs for landmark estimation [45]. However, it also suffered from the limitations of sample degeneracy when sampling the proposal distribution. Parallel Tracking and Mapping [57] was proposed to address the issue by splitting the pose and map estimation into separate threads, which enhance their real-time performance [58,59].

A place recognition system with ORB features was first proposed in [60], which is developed based on Bag-of-Words (BoW). The ORB is a rotational invariant and scale-aware feature [61], which can be used to extract features at a high frequency. Place recognition algorithms can often be highly efficient and run in real time. The algorithm is helpful in relocalisation and loop-closure for Visual-SLAM, and it is further developed with monocular cameras for operating in a large-scale environment [62].

RGB-D SLAM [62] is another feature-based SLAM that uses feature points for generating dense and accurate 3D maps. Several models are proposed to utilise the active camera sensor to develop a 6-DOF motion tracking model capable of 3D reconstruction and achieve impressive performance even under challenging scenarios [63,64]. In contrast to low-level point features, high-level objects often provide a more accurate tracking performance. For example, using a planar SLAM system, we can detect the planar in the environment for yielding a planar map while detecting objects such as desks and chairs for localisatino [65]. The recognition of the objects, however, requires an offline supervised-learning procedure before executing the SLAM procedure.

Direct SLAM refers to methods that directly use the input images without any feature detector and descriptors. In contrast to feature-based methods, these feature-less approaches are generally used in photometric consistency to register two successive images. Using deep-learning models for extracting the environment’s feature representation is promising in numerous robotic domains [66,67]. For example, DTAM [68], LSD-SLAM [69] and SVO [70] are some of the models that had gain lots of successes. DSO models [71,72] are also shown to be capable of using bundle adjustment pipeline of temporal multi-view stereo for achieving high accuracy in a real-time system. In additions, models such as CodeSLAM [73] and CNN-SLAM [74] use deep-learning approach for extracting a dense representations of the environment for performing direct SLAM. However, direct SLAM is often more time-consuming when compared to feature-based SLAM since they operate directly on the image space.

4.2. Localisation with Scene Modelling

Deep learning plays an essential role in scene understanding by utilising a range of information in techniques such as CNN classifications. CNN can be utilised over RGB images for extracting semantic information such as detecting scene or pedestrians within the images [75,76,77]. CNN can also directly operates on captured point cloud information from range-based sensors such as LiDAR. Models such as PointNet [78] in Figure 3 can understand classifying the class of the objects based purely on point clouds. For example, PointNet++ [78], TangentConvolutions [79], DOPS [80], and RandLA-Net [81] are some of the recent deep learning models that can perform semantic understanding using a large scale of point clouds. Most models are trained on some point cloud dataset that enables the model to infer objects and scene information based purely on the geometric orientations of the input points.

Dynamic objects can introduce difficulties in SLAM during loop-closure due to the moving objects. SLAM can tackle this difficulty by utilising semantics information to filter dynamic objects from the input images [82]. Using the scene understanding module, we can filter out moving objects from the images to prevent the SLAM algorithm conditioning on dynamic objects. For example, the SUMA++ model illustrated on the right of Figure 4 can obtain a semantic understanding of each detected object to filter out dynamic objects such as pedestrians and other moving vehicles. However, the increased SLAM accuracy comes with the cost of lowering the accuracy of the estimated robot pose due to the method neglecting parts of the perceived information.

4.3. Scene Modelling with Typological Relationship and Dynamic Models

Scene graphs is a different approach to building a model of the environment that includes both the metric, semantic, and primary topological relationship between the scene objects and the overall environment [83]. Scene graphs can construct an environmental graph that spans an entire building, including objects, materials and rooms within the building [84]. The main disadvantage of scene graphs is the need to compute offline, requiring a known 3D mesh of the building with the registered RGB images to generate the 3D scene graphs. Previous approaches rely on registering RGB images with the 3D mesh of the buildings for generating the 3D scene graphs, which limits their applicability to static environments. Figure 5 illustrates one of the approaches, Dynamic scene graphs (DSG) [85], that can also include dynamic elements within the environment. For example, DSG can model humans that are navigating within the building. The original DSG approach needs to be built offline, but an extension has been proposed [85] which is capable of building a 3D dynamic DSG from visual-inertial data in a fully automatic manner. The approach first builds a 3D mesh-based semantic map fed to the dynamic scene generator.

In addition, we can perform reasoning on the current situation by projecting what will likely happen based on previous events [87]. This class of methods relies on predicting the possible future state of the robot by conditioning on the current belief of our robot state and the robot’s dynamic model [88]. In addition, dynamic models can be incorporated into the objects in the surrounding environment, such as pedestrians and vehicles, for the model to recognise the predicted future pose of the nearby objects with some amount of uncertainty [89].

4.4. Semantic Understanding with Segmentation

Pixel-wise semantic segmentation is another promising direction in SLAM semantic understanding. FCN [75] is a fully convolutional neural network that uses pixel-wise segmentation in the computer vision community for SLAM. ParseNet derived a similar CNN architecture [90] and injected the global context information into the global pooling layers in FCN. The global context information allows the model to achieve better scene segmentation with a more feature-rich representation of the network. SegNet is another novel netowkr [91] that uses an encoder-decoder architecture for segmentation. The decoder architecture helps upsample the captured low-resolution features from the images. Bayesian approaches are helpful in many learning-based robotics application [92,93]. Bayesian SegNet [92] took a probabilistic approach by using dropout layers in the original SegNet for sampling. The Bayesian approach estimates the probability for pixel-level segmentation, which often outperforms the original approach. Conditional Random Fields had been combined with CNN architecture [94] for deriving a mean-field approximate inference as Recurrent neural Networks.

Semantic information is particularly valued in an environment where a robot needs to interact with human [95]. The progress in computer vision semantic segmentation using deep learning is constructive for pushing the research progress in semantic SLAM. By combining model-based SLAM methods with spatio-temporal CNN-based semantic segmentation [96], we can often provide the SLAM model with a more informative feature representation for localisation. The proposed system can simultaneously perform 3D semantic scene mapping and 6-DOF localisation even in a large indoor environment. Pixel-voxel netowk [97] is another similar approach that uses CNN-like architecture for semantic mapping. SemanticFusion [98] integrates the CNN-based semantic segmentation with the dense SLAM technology ElasticFusion [99], resulting in a model that produces a dense semantic map and performs well in an indoor environment.

4.5. Sensors Fusions for Semantic Scene Understanding

With the recent advancements in Deep Learning, numerous Visual-SLAM have also gained treatment success in using the learned models for semantic understanding using data fusion. Models such as Frustrum PointNets [100] utilise both RGB camera and LiDAR sensors to improve the accuracy of understanding the semantics of the scene. Figure 6 illustrates how Frustrum PointNet utilises information from both sensors for data fusion, where a PointNet is first applied for object instance segmentation and amodal bounding box regression. Sensor fusion provides a more rich feature representation for performing data association. For example, VINet is a sensor fusion network [101] that can use the estimated pose from DeepVO [102] along with the inertial sensor readings with an LSTM. During the model training procedure, the prediction and the fusion network are trained jointly to allow the gradient to pass through the entire network. Therefore, both networks can compensate each other, and the fusion system has high performance compared to traditional sensor fusion methods. The same methodology can also be used as a fusion system [103] which is capable of fusing the 6-DOF pose data from the cameras and the magnetic sensors [104].

The information obtained from a camera can also be fused with GPS, INS, and wheel odometry readings as an ego-motion estimation system [105]. The model essentially uses deep learning to capture the temporal motion dynamics. The motion from the camera is utilised in a mixture density network to construct an optical flow vector for better estimation. Direct methods for visual odometry (VO) can often exploit information from the intensity level gathered from the input images. However, these methods cannot guarantee optimality compared to feature-based methods. Semi-direct VO (SVO2) [106] is a hybrid method that uses direct methods to track pixels while relying on feature-based methods for joint optimisation of structure and motions. The hybrid methods take advantage of both approaches to improve the robustness of VO. Similar approaches such as VINS-Fusion [107] are capable of using IMU fused with monocular visual input for estimating odometry with high reliability. Deep neural networks can further learn the rigid-body motion in a CNN architecture [108] using raw point cloud data as input for predicting the SE3 rigid transformation of the robot.

5. Conclusions and Future Directions

Numerous studies have been conducted in the SLAM domain, as mapping and navigation are critical for enabling robots to interact autonomously with the real world. SLAM algorithms remain a promising and exciting research domain due to their ubiquitous needs in mobile robotic applications. SLAM merges ideas from multiple fields that bridge community within the broader robotic system, for example, sensing, perception, localisation and mapping. In addition, a visual SLAM system with learning capability has shown tremendous potential for further exploration. Approaches with deep learning are shown to be more flexible in producing a more robust approach via utilising the semantic information about the surrounding objects. Sensors information such as pose, depth, 3D point cloud, and semantic mapping of the surrounding objects have shown to be highly useful in Visual-SLAM. By fusing the measured readings from different sensors, the learning-based models can utilise more sources of information for a more feature-rich data-association process. We believe a learning-based approach in semantic SLAM is a promising and exciting direction for developing autonomous robots.

SLAM provides the foundations for the autonomous operations of robots. Many possible future directions can further improve the challenges discussed in earlier sections. For example, data association is one of the core problems in SLAM. Some current IMU and visual odometry-based approaches depend highly on sensors’ accuracy or assume some prior on normally distributed and stationary noise. Having an adaptive approach to tackling possible shifting temporal noise distribution can further mitigate the data association problem. Sensor fusion should be another focus in Visual-SLAM due to the availability of various sensors in modern robots. LiDAR and RGB-D cameras are the two most popular approaches in modern SLAM; therefore, combining the rich information provided by the sensors can further improve the current state-of-the-art Visual-SLAM algorithms. Currently, the SLAM and motion planning problems in robotics are typically tackled in a disjointed manner. However, integrating the uncertainty and probabilistic information obtained in the SLAM framework would theoretically provide more information for the robot to plan for its next movement during motion planning. Therefore, integrating motion planning algorithms such as RRT or PRM within SLAM could provide a more robust robotic framework. Finally, several works that depend on deep neural netowkrs have been discussed in previous sections. Integrating methodologies in deep reinforcement learning literature can perhaps provide SLAM with a learnable policy that exploits past SLAM episodes to improve future execution in unseen environments.

We provide a thorough literature review of the fundamental and current state-of-the-art Visual-SLAM models for communicating our current understanding of SLAM approaches. We have shown ongoing evolution in Visual-SLAM, from model-based approaches to deep learning-based methods. Most current SLAM models seek to improve their accuracy and robustness in the high-level cognition and perception within the Visual-SLAM systems. The key to most current semantic Visual-SLAM models lies in designing the network architecture, appropriate loss function, and the data representation of deep learning-based methods. Therefore, the ongoing research progress in deep-learning models will further enhance the capability of Visual-SLAM models.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

References

Borenstein, J.; Everett, H.R.; Feng, L.; Wehe, D. Mobile Robot Positioning: Sensors and Techniques. J. Robot. Syst. 1997, 14, 231–249. [Google Scholar] [CrossRef]
Kolhatkar, C.; Wagle, K. Review of SLAM Algorithms for Indoor Mobile Robot with LIDAR and RGB-D Camera Technology. In Innovations in Electrical and Electronic Engineering; Springer: Berlin/Heidelberg, Germany, 2021; pp. 397–409. [Google Scholar] [CrossRef]
Lai, T.; Ramos, F. Adaptively Exploits Local Structure With Generalised Multi-Trees Motion Planning. IEEE Robot. Autom. Lett. 2021, 7, 1111–1117. [Google Scholar] [CrossRef]
Garrido, S.; Moreno, L.; Abderrahim, M.; Martin, F. Path Planning for Mobile Robot Navigation Using Voronoi Diagram and Fast Marching. In Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, Beijing, China, 9–15 October 2006; pp. 2376–2381. [Google Scholar] [CrossRef]
Lai, T.; Ramos, F.; Francis, G. Balancing Global Exploration and Local-connectivity Exploitation with Rapidly-exploring Random Disjointed-Trees. In Proceedings of the International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019. [Google Scholar] [CrossRef]
Katrakazas, C.; Quddus, M.; Chen, W.H.; Deka, L. Real-Time Motion Planning Methods for Autonomous on-Road Driving: State-of-the-art and Future Research Directions. Transp. Res. Part C Emerg. Technol. 2015, 60, 416–442. [Google Scholar] [CrossRef]
Flint, A.; Mei, C.; Reid, I.; Murray, D. Growing Semantically Meaningful Models for Visual Slam. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 467–474. [Google Scholar] [CrossRef]
Lothe, P.; Bourgeois, S.; Dekeyser, F.; Royer, E.; Dhome, M. Towards Geographical Referencing of Monocular Slam Reconstruction Using 3d City Models: Application to Real-Time Accurate Vision-Based Localization. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 2882–2889. [Google Scholar] [CrossRef]
Weingarten, J.; Siegwart, R. EKF-based 3D SLAM for Structured Environment Reconstruction. In Proceedings of the 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, Edmonton, AB, Canada, 2–6 August 2005; pp. 3834–3839. [Google Scholar] [CrossRef]
Chong, T.; Tang, X.; Leng, C.; Yogeswaran, M.; Ng, O.; Chong, Y. Sensor Technologies and Simultaneous Localization and Mapping (SLAM). Procedia Comput. Sci. 2015, 76, 174–179. [Google Scholar] [CrossRef] [Green Version]
Hong, S.; Bangunharcana, A.; Park, J.M.; Choi, M.; Shin, H.S. Visual SLAM-based Robotic Mapping Method for Planetary Construction. Sensors 2021, 21, 7715. [Google Scholar] [CrossRef]
Bavle, H.; Sanchez-Lopez, J.L.; Schmidt, E.F.; Voos, H. From SLAM to Situational Awareness: Challenges and Survey. arXiv 2021, arXiv:2110.00273. [Google Scholar]
Hess, W.; Kohler, D.; Rapp, H.; Andor, D. Real-Time Loop Closure in 2D LIDAR SLAM. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 1271–1278. [Google Scholar] [CrossRef]
Engel, J.; Stückler, J.; Cremers, D. Large-Scale Direct SLAM with Stereo Cameras. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 1935–1942. [Google Scholar] [CrossRef]
Pang, C.; Tan, Y.; Li, S.; Li, Y.; Ji, B.; Song, R. Low-Cost and High-Accuracy LiDAR SLAM for Large Outdoor Scenarios. In Proceedings of the 2019 IEEE International Conference on Real-time Computing and Robotics (RCAR), Irkutsk, Russia, 4–9 August 2019; pp. 868–873. [Google Scholar] [CrossRef]
Zhu, Y.; Zheng, C.; Yuan, C.; Huang, X.; Hong, X. Camvox: A Low-Cost and Accurate Lidar-Assisted Visual Slam System. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 5049–5055. [Google Scholar] [CrossRef]
Xiong, J.; Liu, Y.; Ye, X.; Han, L.; Qian, H.; Xu, Y. A Hybrid Lidar-Based Indoor Navigation System Enhanced by Ceiling Visual Codes for Mobile Robots. In Proceedings of the 2016 IEEE International Conference on Robotics and Biomimetics (ROBIO), Qingdao, China, 3–7 December 2016; pp. 1715–1720. [Google Scholar] [CrossRef]
Moleski, T.W.; Wilhelm, J. Trilateration Positioning Using Hybrid Camera-LiDAR System. In Proceedings of the AIAA Scitech 2020 Forum, Orlando, FL, USA, 6–10 January 2020; p. 393. [Google Scholar] [CrossRef]
Su, Z.; Zhou, X.; Cheng, T.; Zhang, H.; Xu, B.; Chen, W. Global Localization of a Mobile Robot Using Lidar and Visual Features. In Proceedings of the 2017 IEEE International Conference on Robotics and Biomimetics (ROBIO), Macau, 5–8 December 2017; pp. 2377–2383. [Google Scholar] [CrossRef]
Yang, S.; Song, Y.; Kaess, M.; Scherer, S. Pop-up Slam: Semantic Monocular Plane Slam for Low-Texture Environments. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea, 9–14 October 2016; pp. 1222–1229. [Google Scholar] [CrossRef]
Gonzalez, M.; Marchand, E.; Kacete, A.; Royan, J. S3lam: Structured Scene Slam. arXiv 2021, arXiv:2109.07339. [Google Scholar]
Liao, Z.; Hu, Y.; Zhang, J.; Qi, X.; Zhang, X.; Wang, W. SO-SLAM: Semantic Object SLAM with Scale Proportional and Symmetrical Texture Constraints. IEEE Robot. Autom. Lett. 2022, 7, 4008–4015. [Google Scholar] [CrossRef]
Feder, H.J.S.; Leonard, J.J.; Smith, C.M. Adaptive Mobile Robot Navigation and Mapping. Int. J. Robot. Res. 1999, 18, 650–668. [Google Scholar] [CrossRef]
Chatila, R.; Laumond, J.P. Position Referencing and Consistent World Modeling for Mobile Robots. In Proceedings of the 1985 IEEE International Conference on Robotics and Automation, St. Louis, MO, USA, 25–28 March 1985; Volume 2, pp. 138–145. [Google Scholar] [CrossRef]
Frese, U. A Discussion of Simultaneous Localization and Mapping. Auton. Robot. 2006, 20, 25–42. [Google Scholar] [CrossRef]
Welch, G.; Bishop, G. An Introduction to the Kalman Filter; University of North Carolina at Chapel Hill: Chapel Hill, NC, USA, 1995; Volume 7. [Google Scholar]
Ribeiro, M.I. Kalman and Extended Kalman Filters: Concept, Derivation and Properties. Inst. Syst. Robot. 2004, 43, 46. [Google Scholar]
Carpenter, J.; Clifford, P.; Fearnhead, P. Improved Particle Filter for Nonlinear Problems. IEE Proc.-Radar, Sonar Navig. 1999, 146, 2–7. [Google Scholar] [CrossRef]
Sim, R.; Roy, N. Global A-optimal Robot Exploration in Slam. In Proceedings of the 2005 IEEE International Conference on Robotics and Automation, Barcelona, Spain, 18–22 April 2005; pp. 661–666. [Google Scholar] [CrossRef]
Bryson, M.; Sukkarieh, S. Observability Analysis and Active Control for Airborne SLAM. IEEE Trans. Aerosp. Electron. Syst. 2008, 44, 261–280. [Google Scholar] [CrossRef]
Carrillo, H.; Reid, I.; Castellanos, J.A. On the Comparison of Uncertainty Criteria for Active SLAM. In Proceedings of the 2012 IEEE International Conference on Robotics and Automation, Saint Paul, MN, USA, 14–18 May 2012; pp. 2080–2087. [Google Scholar] [CrossRef]
Lenac, K.; Kitanov, A.; Maurović, I.; Dakulović, M.; Petrović, I. Fast Active SLAM for Accurate and Complete Coverage Mapping of Unknown Environments. In Intelligent Autonomous Systems 13; Springer: Berlin/Heidelberg, Germany, 2016; pp. 415–428. [Google Scholar]
Stachniss, C.; Hahnel, D.; Burgard, W. Exploration with Active Loop-Closing for FastSLAM. In Proceedings of the 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(IEEE Cat. No. 04CH37566), Sendai, Japan, 28 September–2 October 2004; Volume 2, pp. 1505–1510. [Google Scholar]
Cummins, M.; Newman, P. FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance. Int. J. Robot. Res. 2008, 27, 647–665. [Google Scholar] [CrossRef]
Mei, C.; Sibley, G.; Cummins, M.; Newman, P.; Reid, I. A Constant-Time Efficient Stereo Slam System. In Proceedings of the British Machine Vision Conference; BMVA Press: Surrey, UK, 2009; Volume 1, Available online: http://www.bmva.org/bmvc/2009/Papers/Paper056/Paper056.pdf (accessed on 18 September 2022).
Dellaert, F.; Kaess, M. Factor Graphs for Robot Perception. Found. Trends® Robot. 2017, 6, 1–139. [Google Scholar] [CrossRef]
Kschischang, F.R.; Frey, B.J.; Loeliger, H.A. Factor Graphs and the Sum-Product Algorithm. IEEE Trans. Inf. Theory 2001, 47, 498–519. [Google Scholar] [CrossRef] [Green Version]
Kaess, M.; Johannsson, H.; Roberts, R.; Ila, V.; Leonard, J.; Dellaert, F. iSAM2: Incremental Smoothing and Mapping with Fluid Relinearization and Incremental Variable Reordering. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 3281–3288. [Google Scholar] [CrossRef]
Folkesson, J.; Christensen, H.I. Graphical SLAM for Outdoor Applications. J. Field Robot. 2007, 24, 51–70. [Google Scholar] [CrossRef]
Olson, E.; Leonard, J.; Teller, S. Fast Iterative Alignment of Pose Graphs with Poor Initial Estimates. In Proceedings of the 2006 IEEE International Conference on Robotics and Automation, ICRA 2006, Orlando, FL, USA, 15–19 May 2006; pp. 2262–2269. [Google Scholar] [CrossRef]
Thrun, S.; Montemerlo, M. The Graph SLAM Algorithm with Applications to Large-Scale Mapping of Urban Structures. Int. J. Robot. Res. 2006, 25, 403–429. [Google Scholar] [CrossRef]
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Trevor, A.J.; Rogers, J.G.; Christensen, H.I. Omnimapper: A Modular Multimodal Mapping Framework. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 1983–1990. [Google Scholar] [CrossRef]
Sünderhauf, N.; Protzel, P. Switchable Constraints for Robust Pose Graph SLAM. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 1879–1884. [Google Scholar] [CrossRef]
Montemerlo, M.; Thrun, S.; Koller, D.; Wegbreit, B. FastSLAM: A Factored Solution to the Simultaneous Localization and Mapping Problem. In Proceedings of the AAAI/IAAI, Edmonton, AB, Canada, 28 July–1 August 2002; pp. 593–598. Available online: https://www.aaai.org/Papers/AAAI/2002/AAAI02-089.pdf (accessed on 18 September 2022).
Paz, L.M.; Jensfelt, P.; Tardos, J.D.; Neira, J. EKF SLAM Updates in O (n) with Divide and Conquer SLAM. In Proceedings of the 2007 IEEE International Conference on Robotics and Automation, Rome, Italy, 10–14 April 2007; pp. 1657–1663. [Google Scholar] [CrossRef]
Lategahn, H.; Geiger, A.; Kitt, B. Visual SLAM for Autonomous Ground Vehicles. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 1732–1737. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardós, J.D. Orb-Slam2: An Open-Source Slam System for Monocular, Stereo, and Rgb-d Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Chan, S.H.; Wu, P.T.; Fu, L.C. Robust 2D Indoor Localization through Laser SLAM and Visual SLAM Fusion. In Proceedings of the 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Miyazaki, Japan, 7–10 October 2018; pp. 1263–1268. [Google Scholar] [CrossRef]
Holder, M.; Hellwig, S.; Winner, H. Real-Time Pose Graph SLAM Based on Radar. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 1145–1151. [Google Scholar] [CrossRef]
Chen, Q.M.; Dong, C.Y.; Mu, Y.Z.; Li, B.C.; Fan, Z.Q.; Wang, Q.L. An Improved Particle Filter SLAM Algorithm for AGVs. In Proceedings of the 2020 IEEE 6th International Conference on Control Science and Systems Engineering (ICCSSE), Beijing, China, 17–19 July 2020; pp. 27–31. [Google Scholar] [CrossRef]
Zubizarreta, J.; Aguinaga, I.; Montiel, J.M.M. Direct Sparse Mapping. IEEE Trans. Robot. 2020, 36, 1363–1370. [Google Scholar] [CrossRef]
Guivant, J.E.; Nebot, E.M. Optimization of the Simultaneous Localization and Map-Building Algorithm for Real-Time Implementation. IEEE Trans. Robot. Autom. 2001, 17, 242–257. [Google Scholar] [CrossRef]
Leonard, J.J.; Feder, H.J.S. A Computationally Efficient Method for Large-Scale Concurrent Mapping and Localization. In Robotics Research; Springer: Berlin/Heidelberg, Germany, 2000; pp. 169–176. [Google Scholar]
Lu, F.; Milios, E. Globally Consistent Range Scan Alignment for Environment Mapping. Auton. Robot. 1997, 4, 333–349. [Google Scholar] [CrossRef]
Bailey, T.; Nieto, J.; Guivant, J.; Stevens, M.; Nebot, E. Consistency of the EKF-SLAM Algorithm. In Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, Beijing, China, 9–15 October 2006; pp. 3562–3568. [Google Scholar] [CrossRef]
Klein, G.; Murray, D. Parallel Tracking and Mapping for Small AR Workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, Nara, Japan, 13–16 November 2007; pp. 225–234. [Google Scholar] [CrossRef]
Castle, R.; Klein, G.; Murray, D.W. Video-Rate Localization in Multiple Maps for Wearable Augmented Reality. In Proceedings of the 2008 12th IEEE International Symposium on Wearable Computers, Pittsburgh, PA, USA, 28 September–1 October 2008; pp. 15–22. [Google Scholar] [CrossRef]
Pradeep, V.; Rhemann, C.; Izadi, S.; Zach, C.; Bleyer, M.; Bathiche, S. MonoFusion: Real-time 3D Reconstruction of Small Scenes with a Single Web Camera. In Proceedings of the 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Adelaide, SA, Australia, 1–4 October 2013; pp. 83–88. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardós, J.D. Fast Relocalisation and Loop Closing in Keyframe-Based SLAM. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 846–853. [Google Scholar] [CrossRef] [Green Version]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An Efficient Alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar] [CrossRef]
Endres, F.; Hess, J.; Sturm, J.; Cremers, D.; Burgard, W. 3-D Mapping with an RGB-D Camera. IEEE Trans. Robot. 2013, 30, 177–187. [Google Scholar] [CrossRef]
Kueng, B.; Mueggler, E.; Gallego, G.; Scaramuzza, D. Low-Latency Visual Odometry Using Event-Based Feature Tracks. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea, 9–14 October 2016; pp. 16–23. [Google Scholar] [CrossRef]
Kim, H.; Leutenegger, S.; Davison, A.J. Real-Time 3D Reconstruction and 6-DoF Tracking with an Event Camera. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 349–364. [Google Scholar]
Salas-Moreno, R.F.; Newcombe, R.A.; Strasdat, H.; Kelly, P.H.; Davison, A.J. Slam++: Simultaneous Localisation and Mapping at the Level of Objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 1352–1359. [Google Scholar]
Hewing, L.; Wabersich, K.P.; Menner, M.; Zeilinger, M.N. Learning-Based Model Predictive Control: Toward Safe Learning in Control. Annu. Rev. Control. Robot. Auton. Syst. 2020, 3, 269–296. [Google Scholar] [CrossRef]
Lai, T.; Ramos, F. Plannerflows: Learning Motion Samplers with Normalising Flows. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems, Prague, Czech Republic, 27 September–1 October 2021; pp. 2542–2548. [Google Scholar] [CrossRef]
Newcombe, R.A.; Lovegrove, S.J.; Davison, A.J. DTAM: Dense Tracking and Mapping in Real-Time. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2320–2327. [Google Scholar] [CrossRef]
Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-scale Direct Monocular SLAM. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 834–849. [Google Scholar]
Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast Semi-Direct Monocular Visual Odometry. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 15–22. [Google Scholar] [CrossRef]
Engel, J.; Koltun, V.; Cremers, D. Direct Sparse Odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 611–625. [Google Scholar] [CrossRef]
Wang, R.; Schworer, M.; Cremers, D. Stereo DSO: Large-scale Direct Sparse Visual Odometry with Stereo Cameras. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3903–3911. [Google Scholar]
Bloesch, M.; Czarnowski, J.; Clark, R.; Leutenegger, S.; Davison, A.J. CodeSLAM—Learning a Compact, Optimisable Representation for Dense Visual SLAM. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2560–2568. [Google Scholar]
Tateno, K.; Tombari, F.; Laina, I.; Navab, N. Cnn-Slam: Real-time Dense Monocular Slam with Learned Depth Prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6243–6252. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Geraldes, R.; Gonçalves, A.; Lai, T.; Villerabel, M.; Deng, W.; Salta, A.; Nakayama, K.; Matsuo, Y.; Prendinger, H. UAV-based Situational Awareness System Using Deep Learning. IEEE Access 2019, 7, 122583–122594. [Google Scholar] [CrossRef]
Peng, C.; Zhang, K.; Ma, Y.; Ma, J. Cross Fusion Net: A Fast Semantic Segmentation Network for Small-Scale Semantic Information Capturing in Aerial Scenes. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–13. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep Learning on Point Sets for 3d Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Tatarchenko, M.; Park, J.; Koltun, V.; Zhou, Q.Y. Tangent Convolutions for Dense Prediction in 3d. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3887–3896. [Google Scholar]
Najibi, M.; Lai, G.; Kundu, A.; Lu, Z.; Rathod, V.; Funkhouser, T.; Pantofaru, C.; Ross, D.; Davis, L.S.; Fathi, A. Dops: Learning to Detect 3d Objects and Predict Their 3d Shapes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11913–11922. [Google Scholar]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Randla-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11108–11117. [Google Scholar]
Wang, Z.; Zhang, Q.; Li, J.; Zhang, S.; Liu, J. A Computationally Efficient Semantic Slam Solution for Dynamic Scenes. Remote Sens. 2019, 11, 1363. [Google Scholar] [CrossRef]
Armeni, I.; He, Z.Y.; Gwak, J.; Zamir, A.R.; Fischer, M.; Malik, J.; Savarese, S. 3d Scene Graph: A Structure for Unified Semantics, 3d Space, and Camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 5664–5673. [Google Scholar]
Wald, J.; Dhamo, H.; Navab, N.; Tombari, F. Learning 3d Semantic Scene Graphs from 3d Indoor Reconstructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3961–3970. [Google Scholar]
Rosinol, A.; Violette, A.; Abate, M.; Hughes, N.; Chang, Y.; Shi, J.; Gupta, A.; Carlone, L. Kimera: From SLAM to Spatial Perception with 3D Dynamic Scene Graphs. Int. J. Robot. Res. 2021, 40, 1510–1546. [Google Scholar] [CrossRef]
Chen, X.; Milioto, A.; Palazzolo, E.; Giguere, P.; Behley, J.; Stachniss, C. Suma++: Efficient Lidar-Based Semantic Slam. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 4530–4537. [Google Scholar] [CrossRef]
Castillo-Lopez, M.; Ludivig, P.; Sajadi-Alamdari, S.A.; Sanchez-Lopez, J.L.; Olivares-Mendez, M.A.; Voos, H. A Real-Time Approach for Chance-Constrained Motion Planning with Dynamic Obstacles. IEEE Robot. Autom. Lett. 2020, 5, 3620–3625. [Google Scholar] [CrossRef]
Sanchez-Lopez, J.L.; Arellano-Quintana, V.; Tognon, M.; Campoy, P.; Franchi, A. Visual Marker Based Multi-Sensor Fusion State Estimation. IFAC-PapersOnLine 2017, 50, 16003–16008. [Google Scholar] [CrossRef]
Lefkopoulos, V.; Menner, M.; Domahidi, A.; Zeilinger, M.N. Interaction-Aware Motion Prediction for Autonomous Driving: A Multiple Model Kalman Filtering Scheme. IEEE Robot. Autom. Lett. 2020, 6, 80–87. [Google Scholar] [CrossRef]
Liu, W.; Rabinovich, A.; Berg, A.C. Parsenet: Looking Wider to See Better. arXiv 2015, arXiv:1506.04579. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Kendall, A.; Badrinarayanan, V.; Cipolla, R. Bayesian Segnet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding. arXiv 2015, arXiv:1511.02680. [Google Scholar]
Lai, T.; Morere, P.; Ramos, F.; Francis, G. Bayesian Local Sampling-Based Planning. IEEE Robot. Autom. Lett. 2020, 5, 1954–1961. [Google Scholar] [CrossRef]
Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.; Su, Z.; Du, D.; Huang, C.; Torr, P.H. Conditional Random Fields as Recurrent Neural Networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 1529–1537. [Google Scholar]
Hülse, M.; McBride, S.; Lee, M. Fast Learning Mapping Schemes for Robotic Hand–Eye Coordination. Cogn. Comput. 2010, 2, 1–16. [Google Scholar] [CrossRef]
Li, R.; Gu, D.; Liu, Q.; Long, Z.; Hu, H. Semantic Scene Mapping with Spatio-Temporal Deep Neural Network for Robotic Applications. Cogn. Comput. 2018, 10, 260–271. [Google Scholar] [CrossRef]
Zhao, C.; Sun, L.; Purkait, P.; Duckett, T.; Stolkin, R. Dense Rgb-d Semantic Mapping with Pixel-Voxel Neural Network. Sensors 2018, 18, 3099. [Google Scholar] [CrossRef]
McCormac, J.; Handa, A.; Davison, A.; Leutenegger, S. Semanticfusion: Dense 3d Semantic Mapping with Convolutional Neural Networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4628–4635. [Google Scholar] [CrossRef]
Whelan, T.; Leutenegger, S.; Salas-Moreno, R.; Glocker, B.; Davison, A. ElasticFusion: Dense SLAM without a pose graph. In Proceedings of the Robotics: Science and Systems, Rome, Italy, 13–17 July 2015; Available online: https://spiral.imperial.ac.uk/bitstream/10044/1/23438/2/whelan2015rss.pdf (accessed on 18 September 2022).
Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum Pointnets for 3d Object Detection from Rgb-d Data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
Clark, R.; Wang, S.; Wen, H.; Markham, A.; Trigoni, N. Vinet: Visual-inertial Odometry as a Sequence-to-Sequence Learning Problem. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar] [CrossRef]
Wang, S.; Clark, R.; Wen, H.; Trigoni, N. Deepvo: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 2043–2050. [Google Scholar] [CrossRef]
Turan, M.; Almalioglu, Y.; Gilbert, H.; Sari, A.E.; Soylu, U.; Sitti, M. Endo-VMFuseNet: Deep Visual-Magnetic Sensor Fusion Approach for Uncalibrated, Unsynchronized and Asymmetric Endoscopic Capsule Robot Localization Data. arXiv 2017, arXiv:1709.06041. [Google Scholar]
Turan, M.; Almalioglu, Y.; Gilbert, H.; Araujo, H.; Cemgil, T.; Sitti, M. Endosensorfusion: Particle Filtering-Based Multi-Sensory Data Fusion with Switching State-Space Model for Endoscopic Capsule Robots. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 5393–5400. [Google Scholar] [CrossRef]
Pillai, S.; Leonard, J.J. Towards Visual Ego-Motion Learning in Robots. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 5533–5540. [Google Scholar] [CrossRef]
Forster, C.; Zhang, Z.; Gassner, M.; Werlberger, M.; Scaramuzza, D. SVO: Semidirect visual odometry for monocular and multicamera systems. IEEE Trans. Robot. 2017, 33, 249–265. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Byravan, A.; Fox, D. Se3-Nets: Learning Rigid Body Motion Using Deep Neural Networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 173–180. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Formulating the visual-SLAM problem with a factor graph, where the camera poses are denoted as

X_{i}^{w}

and landmarks as

l_{j}

. The observations of the landmarks and odometry at various camera poses are denoted as

z_{k}

and

u_{i}

, respectively. The prior belief on the initial pose is denoted as

P_{0}

, and the joint probability distribution of the MAP problem can be computed to the product of the depicted factors.

Figure 1. Formulating the visual-SLAM problem with a factor graph, where the camera poses are denoted as

X_{i}^{w}

and landmarks as

l_{j}

. The observations of the landmarks and odometry at various camera poses are denoted as

z_{k}

and

u_{i}

, respectively. The prior belief on the initial pose is denoted as

P_{0}

, and the joint probability distribution of the MAP problem can be computed to the product of the depicted factors.

Figure 2. Visual-SLAM Bundle Adjustment (BA) in a factor graph. The potential odometry factor

u_{i}

constrain the relative camera poses with potential loop-closure factors

c_{i_{1}, i_{2}}

where

i_{1}, i_{2}

are the index of the camera poses. This figure demonstrate loop-closure factors

c_{1, N - 1}

between the camera pose

X_{1}^{w}

and

X_{N - 1}^{w}

, and

c_{2, N}

between

X_{2}^{w}

and

X_{N}^{w}

for deciding whether the mobile robot had returned to a previously visited area.

Figure 2. Visual-SLAM Bundle Adjustment (BA) in a factor graph. The potential odometry factor

u_{i}

constrain the relative camera poses with potential loop-closure factors

c_{i_{1}, i_{2}}

where

i_{1}, i_{2}

are the index of the camera poses. This figure demonstrate loop-closure factors

c_{1, N - 1}

between the camera pose

X_{1}^{w}

and

X_{N - 1}^{w}

, and

c_{2, N}

between

X_{2}^{w}

and

X_{N}^{w}

for deciding whether the mobile robot had returned to a previously visited area.

Figure 3. Example of using PointNet [78] for performing part segmentation directly on input point clouds.

Figure 4. Using environment features to create a semantic map. SUMA++ [86] operating under an environment using LiDAR sensors, which provides rich information to understand the environment around the vehicle.

Figure 5. Using Dynamic Scene Graph (DSG) [85] for generating multi-layer abstraction of an indoor environment.

Figure 6. Multi-modal model Frustrum PointNets [100] which uses CNN model to projects detected objects from RGB images into 3D space, thus improving the accuracy on semantic understanding.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lai, T. A Review on Visual-SLAM: Advancements from Geometric Modelling to Learning-Based Semantic Scene Understanding Using Multi-Modal Sensor Fusion. Sensors 2022, 22, 7265. https://doi.org/10.3390/s22197265

AMA Style

Lai T. A Review on Visual-SLAM: Advancements from Geometric Modelling to Learning-Based Semantic Scene Understanding Using Multi-Modal Sensor Fusion. Sensors. 2022; 22(19):7265. https://doi.org/10.3390/s22197265

Chicago/Turabian Style

Lai, Tin. 2022. "A Review on Visual-SLAM: Advancements from Geometric Modelling to Learning-Based Semantic Scene Understanding Using Multi-Modal Sensor Fusion" Sensors 22, no. 19: 7265. https://doi.org/10.3390/s22197265

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Review on Visual-SLAM: Advancements from Geometric Modelling to Learning-Based Semantic Scene Understanding Using Multi-Modal Sensor Fusion

Abstract

1. Introduction

2. Simultaneous Localisation and Mapping

2.1. Problem Setup

2.2. SLAM Formulation

2.3. Factor Graph and Loop-Closure

3. Evolution of SLAM Techniques and Paradigms

4. Visual-SLAM

4.1. Feature-Based and Direct SLAM

4.2. Localisation with Scene Modelling

4.3. Scene Modelling with Typological Relationship and Dynamic Models

4.4. Semantic Understanding with Segmentation

4.5. Sensors Fusions for Semantic Scene Understanding

5. Conclusions and Future Directions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI