1. Introduction
With the sensor and algorithm innovations [
1], mobile robots are getting smaller and smarter and are addressing new applications in medicine, agriculture, and security applications [
2,
3]. The improved sensors allow them to become much more capable for perception, mapping, and navigation.
Unlike indoor navigation, which focuses more on 2D, even and structured environments, rescue robots need more mobility to face various, complicated scenarios such as slopes, stairs, and tunnels [
4,
5,
6]. Search and rescue robots are usually equipped with LIDAR and high-quality IMU sensors, which are fused with the wheel odometry to estimate the poses and the map [
7,
8,
9]. However, these leads to expensive robots, which is undesirable, because such mobile rescue robots have a high risk of breaking or being lost during operation. In the past years, we have seen great advances in the area of computer vision, also with respect to vSLAM (visual Simultaneous Localization and Mapping) [
10]. Since then, many methods have been proposed to improve the accuracy, robustness, and efficiency of vSLAM, such as using first estimation Jacobian [
11] to reduce the inherent nonlinearity in the system. Optimization-based methods also have become very popular [
12]. Additionally, monocular cameras were combined with other sensors such as depth cameras [
13,
14] and IMU [
15,
16,
17] to achieve more robust performance.
However, these approaches are mostly used in hand-held or UAV (Unmanned Aerial Vehicle) applications, where the motions follow smooth, curved trajectories and frequently change the velocity. Thus, even with strong rotation, when there are few features to track in the image sequences, the IMU can still sense the motion well, which, thus, limits the pose estimation to an acceptable error [
15,
17].
On the other hand, ground robots have some unique advantages in rescue scenarios [
18], compared to aerial robots, for example the long operation time, the rich sensor payload [
19] and the ability to traverse in very confined areas. Hence, ground robots still play a critical role in search and rescue scenarios [
6]. When rescue robots traverse difficult 3D terrain, the motions can be more complicated [
4,
20] than those of a UAV. There are very sudden accelerations when falling off a small edge or a stair step, or bumping into a small obstacle. This will result in strong ego-motions, which make it difficult for vSLAM and VIO (Visual Inertial Odometry) to track the features. Fusing the vision data with IMU readings is a direct idea to solve this problem. However, moving straight on a planar environment, the trajectories are mostly in constant acceleration and quite often constant in speed, which leads to a close to zero acceleration. In this case the VINS (Vision-aided INertial System) model suffers extra unobservable directions [
21], which leads to inaccurate estimations.
This challenge becomes even tougher for tracked robots or small-sized robots. Besides above problem of unobservability, the IMU readings suffer from large amounts of noise. This is because the small, tracked robots, with a relatively small mass, get easily excited from small obstacles such as stones or sand, over which the robot drives, leading to big and frequent vibrations. An additional challenge rescue robots face is, that there are often high illumination changes and also, that environments can have few features, because everything is covered with dust [
22]. So visual trackers for rescue robots face additional challenges.
In recent decades we have seen many innovative and improved sensor concepts, such as RGBD cameras, LIDAR and event cameras [
23]. They are perceiving the environments using different approaches and integrate several different sensors on a single board thanks to the advances in MEMS (MicroElectroMechanical System) technology. Kinect and Kinect V2 are two very popular RGBD cameras using structured light and the latter ToF (Time of Flight) technology, respectively. They are both quite large and do not work outdoors, because the sunlight is over shining the infrared light emitted by the devices. LIDARs work outdoors and are quite accurate, but they deliver quite sparse data, are quite expensive and relatively fragile, due to the moving parts (mirror). In the last few years, Intel released their RealSense camera series [
24], from R200 to ZR300 to D415/D435 [
25] and other models. Some of them share the same market niche as Kinect, while others are facing robotic systems which are compact and relatively cheap. A big advantage of some of those devices, such as ZR300, is, that they come with an integrated IMU. For robotics, a challenging problem is the limited measurement range of those sensors, with a large minimum range and a small maximum. This significantly limits its application and thus many open source VIO or SLAM systems [
15,
17] only make use of the integrated fisheye camera and IMU sensor. Fortunately, Intel released the D435i [
26] after the D415/D435. It integrates the same IMU as the ZR300, but uses the vision sensors of the D435, such that we can take advantage of D435’s excellent range (0.2–10 m+), combined with the time-synchronized accelerometer and gyroscope measurements. Another advantage of the Intel RealSense sensor series is, that they also work outdoors in sunny conditions, due to their stereo-infrared camera approach [
24,
27,
28], but also project an infrared pattern to be able to detect features-less surfaces.
The goal of our work is to enable a small rescue robot to do visual SLAM and thus ultimately navigation. We address the difficulties described above and propose to utilize the three different sensor readings of color, depth and IMU provided by the Intel RealSense D435i camera. For that, we base our solution, VINS-RGBD, on the open source VINS implementation [
15], which is using RGB and IMU. We extend the framework to also incorporate the depth readings and ultimately build an RGBD inertia SLAM system for ground robots. The main contributions of our work are as follows:
Formulation and implementation of a depth-integrated initialization process for the VINS-RGBD system.
Formulation and implementation of a depth-integrated Visual Inertial Odometry (VIO), overcoming the degenerated cases of a vision and IMU only VIO system.
Design and implementation of a backend mapping function to build dense point clouds with noise suppression, which is suitable for further map post processing and path planning.
A color-depth-inertial dataset with handheld, wheeled robot, and tracked robot motion, with tracking system data for ground truth poses.
Thorough evaluation of the proposed VINS-RGBD system using the three datasets mentioned above.
We provide the dataset [
29], which also includes a video of the experiments, and VINS-RGBD [
30] as open source for the benefit of the robotics and visual SLAM communities.
The rest of the paper is structured as follows:
Section 2 describes important related work and the applicability towards VIO with ground robots.
Section 3 gives an overview of the system which we base our work, VINS-Mono, and the observability problem it meets.
Section 4 introduces our VINS-RGBD approach while
Section 5 gives corresponding experiments and results. In
Section 6 we draw conclusions and discuss future work.
2. Related Work
Compared to LIDAR Simultaneous Localization and Mapping (SLAM), which has under development for decades, visual SLAM (vSLAM) has recently gained more and more attention, especially after the influential work of Davison et al. [
10], which showed that SLAM with a single camera can run in real-time on a PC. Extended Kalman Filter (EKF) or its variants are often used in LIDAR SLAM, and also in that first MonoSLAM [
10]. Optimization based methods are becoming more popular, since they can find the sparsity of the Hessian matrix in the optimization function and can usually provide more accurate results. PTAM [
31] was the first optimization based method and also the first one to propose to do tracking and mapping in parallel. Based on that tracking front end and optimization back end, approaches such as ORB-SLAM [
12] were developed, which additionally use bag-of-words [
32] to do loop detection to reduce the accumulative error. After projecting the environment to image planes, the depth information is lost and only the perspective geometry relationship is saved. Therefore, monocular SLAM is up-to-scale. However, combined with another camera with a fixed, known baseline, we can obtain the depth by calculating the disparity of sufficiently distinguishable feature points. Next to this stereo camera approach also structured light and Time of Flight (ToF) techniques, combined with a color camera, can also provide depth and color data, thus RGBD data. Visual SLAM with RGBD cameras [
13,
33] can be applied to robot autonomy and navigation. As both stereo and RGBD SLAM are using a single camera as the perception source, the performance is limited by camera characteristics such as motion blur under fast rotation, distortion caused by the rolling shutter, and also the short measurement range of the RGBD camera.
For ground robots quite often the trajectory estimation, so the localization of the robot, is assumed to be on the flat ground. We can thus estimate the trajectory under the 2D plane constraint, which accelerates the estimation speed and also improves the robustness with the motion model constraint. Scaramuzza [
34] models the 2D motion to approximate planar circular motion, thus a 1-point minimum relative pose estimate solver can be applied. With a general planer motion model, an iterative 2-point algorithm [
35] using the Newton method, a 3-point algorithm which is using a linear equation and non-iterative 2-point algorithms [
36] were proposed.
Unlike Unmanned Ground Vehicles (UGVs), Unmanned Aerial Vehicles (UAVs), especially quad-copters, have to estimate their motion in 3D with 6 degrees of freedom. To avoid obstacles also the unknown scale factor has to be solved. So there is considerable work on vSLAM for UAVs where the cameras are combined with inertial measurement units (IMU) to achieve robust and scale deterministic pose estimation. Due to the complex motions of the UAV, the IMU is sufficiently activated and thus unobservable states are avoided. The visual and inertial fusion also called VINS (Visual INertial System). It is usually divided into the categories of loosely and tightly coupled systems, where the former [
37,
38] treats the two sensors as individual pose estimation sources and fuses them by using EKF. Tight coupled visual inertial fusion has both EKF based algorithms such as ROVIO [
17], MSCKF [
39] and optimization based algorithms such as VINS-Mono [
15] and OKVIS [
40]. Tightly coupled methods joint all the IMU and camera states together and usually get more accurate results. Most of the VINS are developed to deploy on UAVs or handheld applications. They only fuse a single camera with IMU, as this already gives good results in their application. However, in optimization based algorithms, it is easy to add other residual terms, hence, making it possible to fuse IMU with stereo or even more other sensors [
41]. This can also solve the problem mentioned in [
21], which is that the scale is unobservable under special motions. However, they only give the results of 2D performance, and also requite a complex configuration of the stereo camera with the IMU sensor, especially w.r.t time synchronization. Besides the unobservability problem, initial values of VINS are crucial for the subsequent VIO (Visual Inertial Odometry) process, and lots of work [
42,
43,
44] has been put into achieving better performance, mostly for handheld or UAV applications.
Combining RGBD sensors with IMUs for visual SLAM is another way to solve the scale unobservability problem [
21], which is also suitable for ground robots navigation. RGBD sensors calculate the depth independently and efficiently and also output dense depth images rapidly. An IMU RGBD extrinsic calibration is proposed in [
45], where they analyze the observability and design an OC (Observability Constrained) EKF to improve consistency. A loosely coupled method, which fuses inertial and RGBD cameras for mobile devices is introduced in [
46], where an array of Kalman filters is used. Another loosely coupled method is proposed in [
47], with application in direct methods of frame-to-frame motion estimation. There, the performance of the IMU aid is evaluated in both semi-dense and fully dense tracking processes. Ref. [
48] is the first tightly coupled optimization-based method in RGBD inertial fusion, which focuses on 3D reconstruction with running in real-time on a GPU, while [
49] is another extended Kalman filter based indoor scene reconstruction RGBD-inertial method. By utilizing inertial fusion and kernel cross-correlators to match point clouds in the frequency domain, Wang et al. [
50] propose a non-iterative odometry, where a dense indoor fused map is given. Ling et al. [
51] simply combine Kinect depth images with ORB features to achieve indoor 2D robot pose estimation. However, in this paper only limited qualitative results are reported.
Three-dimesnional reconstruction differs from mapping for robotics mainly in the fact that robots are primarily interested whether a location is occupied or free, while 3D reconstruction cares more about the texture of the estimated shapes. In robotics the original point cloud map is then usually treated as an input to different kinds of post-processes, that are employed to make use of the map. Elevation maps [
52] are 2.5D representations which store one height-value per cell in the x-y plane. Multi-Level surface maps [
53] are another more powerful map representation that also supports loop closing. OctoMap [
54] is a real 3D map representation based on octrees, which can be dynamically expanded with regard to global and local planning scenes. Yang et al. [
55] propose an update strategy to OctoMap after loop closing, which builds a globally consistent map.
A benchmark comparison of the VINS algorithms is discussed in [
56]. Although it is aiming for flying robots, the results show that VINS-Mono [
15] achieves the most accurate performance while having a low computational resource requirement. Therefore we build our the RGBD inertia system for 3D pose estimation of ground robots on the open source VINS Implementation and call it VINS-RGBD. We also extend the VINS system to include a mapping module that is making use of the dense depth information from the RGBD sensor. This 3D map enables further applications such as path planning and navigation. Hence, we propose to generate noise-eliminated point clouds in an octree.
3. VINS-Mono Analysis
VINS-Mono [
15] is a tightly coupled, nonlinear optimization based method consisting of a feature tracking module, a visual-inertial odometry which contains the necessary estimator initialization process for the IMU parameters and the metric scale initialization, and a sliding window optimization method which is taking the initialization value as the start value. Moreover, a pose graph module is taking care of relocalization (i.e., loop detection) and the subsequent global optimization. It is a state-of-the-art visual-inertial system, which has a good performance on UAVs and hand-held applications. The system structure is shown in
Figure 1. The red solid line blocks represent the modules we modified to incorporate depth information, and the red dashed line blocks are the modules we newly added. Black blocks are original modules of VINS-Mono. We first introduce the original blocks as the base of our work.
3.1. Vision Front End
VINS-Mono supports both rolling and global shutter cameras. It is using the KLT sparse optical flow algorithm [
57] to track Shi–Tomasi features [
58]. The features are made to be uniformly distributed over the images to achieve a robust tracking performance. Keyframes are selected by checking the average parallax between the current frame and the latest keyframes, along with the number of tracked features. If the parallax is beyond a certain threshold or the number of tracked features is below a certain threshold, a new keyframe is inserted.
3.2. IMU Pre-Integration
An IMU typically includes three orthogonal axis accelerometers (
) and a 3-axis gyroscope (
), which measure acceleration and angular velocity with respect to an inertial frame. Affected by noise, IMU measurements commonly contains additive white noise
and time-varying bias
:
As shown in Equation (
1), accelerometers and gyroscopes suffer from noise that is usually modeled as Gaussian white noise,
The bias is usually modeled as a random walk process [
15,
17,
59].
The accelerometer estimation also suffers the need to compensate the gravity vector in world coordinate system. Earth’s rotation is usually neglected, because it is very small compared to the noise mentioned above.
MEMS-IMUs typically have an accelerometer and gyroscope output rate of 200–400 Hz, while cameras usually have a frame rate around 30 Hz. Since they are in different orders of magnitude, i.e., there are about 10 IMU measurements between two camera frames, IMU measurements are usually pre-integrated between two visual frames. The pre-integrated values are treated as constraints between the two camera frames, which is used later during the local window optimization together with the camera reprojection residuals.
We are using
and
to indicate two consecutive camera frames, where
k is the frame index. As mentioned above, there are a couple of IMU measurements between time
(at frame
) and time
(at frame
). The pre-integration is then defined as follows:
The three items in Equation (
4)
are corresponding to translation, velocity and rotation, respectively.
and
are ideal values without noise and bias. Having
as the skew symmetric matrix of a vector,
is defined as follows:
To save computational resources, the bias and are assumed to be constant during the pre-integration calculation between two consecutive camera frames. Then, the Jacobian matrixes with respect to every pre-integration items are calculated to correct the effect of bias changes. And if the estimated bias changes a lot, the pre-integration process is re-propagated under the new bias value.
3.3. Visual-Inertial Initialization and VIO
VINS-Mono tightly couples camera and IMU states in estimation. Also, as we have seen above, the IMU has a complex parameterization of the noise and bias. Thus, an initial process is necessary for parameter initialization of values such as gyroscope bias, gravity vector direction, and the metric scale for the up-to-scale mono visual SFM part. Once the initialization is done, the system can do the sliding window based local VIO process.
3.3.1. Visual-Inertial Initialization
The SfM problem can be solved up-to-scale using only visual information. If the extrinsic
, where
indicates translation (position
) or rotation (quaternion
) of the body/IMU frame with respect to the camera frame, is provided offline [
60,
61], then the IMU pose
is also determined up-to-scale by a scalar
s and camera pose
,
Here, w is set to equal the first camera frame, and ⊗ represents quaternions multiplication.
Combining these states with the IMU pre-integration term
, we can calibrate the gyroscope bias. We have all IMU orientations
and the pre-integration term
in the window. Using
to represent all frame indexes in the window, the gyroscope bias is obtained by minimizing a least-square function:
Here represents the Jacobian matrix of the pre-integration term with respect to .
Combining the states with the IMU pre-integration terms
and
, the velocity, gravity vector, and scale can also be calculated by formulating a linear measurement model and solving this least square problem. We present this part in
Section 4.2.
3.3.2. Visual-Inertial Odometry
Once the initialization is done, the sliding window based local VIO process is active to continuously estimate the state. With extrinsic transform between the camera and IMU
calibrated offline, the state variables in the sliding window are defined as follows:
n is the number of frames in the sliding window,
is IMU state synchronized with the
kth camera frame, which contains velocity, orientation and position, acceleration and gyroscope bias with respect to world frame.
represents the inverse depth from triangulation of all landmarks [
62] in the frame of their first observation.
The IMU residual
between two consecutive frame observations
and states
is then defined(denoted as symbol ≐) as follows:
Here, , corresponds to translation and velocity IMU pre-integration residual terms, respectively. is in axis-angle representation and corresponds to the rotation pre-integration residual term and is also the rotation residual. is the function that converts quaternions to the axis-angle vector representation.
The camera visual residual
is defined as the sum of the reprojection errors of every landmark visible in the window. Knowing the extrinsic calibration
, the camera poses can be transformed to the IMU poses. Combined with all landmark depths in the state vector, it is easy to formulate the reprojection residual. During the sliding window based VIO process, new states are continuous added while old states are eliminated. Directly discarding the old states will cause a loss of accuracy, thus a marginalization strategy, implemented using the Schur complement [
63], is used. The VIO optimization function containing the residuals of marginalization
, IMU
and camera
is written as:
Here, are the constructed prior information coming from the marginalization. The pair indicates the lth feature observed in the kth camera frame in the window. is the Huber norm loss function to reduce outlier influence.
3.4. Loop Detection and Pose Graph Optimization
As the sliding window based VIO is running, the error is unavoidably accumulated, due to all kinds of noise, such as IMU and camera measurement noise, calculation accuracy error, the residual error at the end of the optimization as well as errors induced by the marginalization process. Like most SLAM systems, VINS-Mono also incorporates a loop closure model, which is using a Bag of Words approach (DBoW2) [
32] for loop detection. Once the loop is detected, the loop constraint is added together with other relative transformation constraints between sequential keyframes. Then, the pose graph optimization process is active to distribute the error on all poses. In normal applications such as handheld and UAVs, the scale is observable thanks to the accelerometer measurements. The roll and pitch angles are also observable from the IMU. The global pose graph optimization hence, only needs to perform in 4-DOF of
, and yaw.
3.5. Observability Analysis of VINS-Mono
For dynamic systems, observability is a fundamental property, and the nullspace of the observability matrix is corresponding to system’s unobservable directions. More specifically, the rank of the nullspace is equal to the number of unobservable directions, while each unobservable direction has a corresponding physical meaning [
64]. Besides, the observability is a system inherent property and hence, independent from the implementation. Since VINS-Mono is still using the normal VINS system modeling equations [
64], it suffers from four unobservable directions, which are the 3-DOF global translation and the 1-DOF rotation around the gravity vector, which is the yaw angle. It is degenerated under special motions, which are frequently encountered with ground robots. For instance, when robots are moving at constant local acceleration,
such as a uniform linear motion of turning at a constant velocity, the system has another unobservable direction, which corresponds to scale. Note that the acceleration of the IMU is inherently measured in body frame, however to avoid ambiguity, we explicitly label the frame in Equation (
11). To simplify the problem, we consider one state and change the corresponding visible landmarks expression in Equation (
8) to contain
, denoted as
f:
Thus, the corresponding up-to-scale state vector
can be written as:
where we use superscript ′ to represent up-to-scale. The error state [
64] of
is therefore:
After rewriting the equation, we get:
It is shown in [
65] that
lies in the nullspace of the observability matrix, under the condition that Equation (
8) is satisfied. It is clear that the error state can be changed by arbitrary
s with the same system measurements. Hence, the system is suffering scale unobservability under this special motion. The physical interpretation is also presented in [
65]: The accelerometer bias can not be distinguished from true acceleration and since monocular VINS is relying on acceleration to achieve scale information, it is unobservable under this condition.
To intuitively show the constant local acceleration motion, we plot measured acceleration values under three different scenarios in
Figure 2. The columns from left to right represent handheld, wheeled robot under circular motions and tracked robot under uniform linear motion and rows from top to bottom represent the
axes separately. The raw acceleration values are plotted in red. Just for visualization in
Figure 2 we also show the result of a lowpass filter in blue for the wheeled and tracked robot data, which contain lots of noise. It is clear that in the handheld applications, the IMU is sufficiently excited, especially in x and y direction, while wheeled and tracked robots under circular and uniform linear motion have almost constant acceleration. Besides, they suffer considerably from noise, especially for the tracked robots. Although IMU-preintegration can overcome most noise since if it is zero mean, the monocular system still becomes unstable when the noise is quite large, which is the usual condition of tracked robots.
On the other hand, when robots are performing no rotation motion
they suffer other unobservable directions, which makes all three global orientations unobservable. Wu et al. [
21] developed mVINS (VINS within a Manifold), which incorporates VINS with manifold motions to solve the problem. Unfortunately, for rescue robots, the assumption is invalid due to the complexity of the terrain. Thanks to the loop closure model and our map filter, though it is left with an unobservable global orientation, we can still estimate a good trajectory and build nice maps.
4. VINS-RGBD
To address the scale unobservability problem described above and to improve the overall accuracy and robustness, we propose to incorporate the depth measurements of the RGBD sensor into the VINS-Mono system. We will be using the Realsense D435i RGBD sensor, which has an integrated IMU. Our algorithm contains a feature tracker node which is tracking features on the RGB image and gets the distance information from depth image. By utilize depth images, the system initialization gains another source of scale information, thus can be more robust and accurate. The VIO process is also working with the depth source to address the scale unobservable problem under constant local acceleration motion, but it also increases the accuracy for normal motions. Last but not least, we develop a mapping function which builds dense point clouds and eliminates noise at the same time.
4.1. Depth Estimation
The feature extracting and tracking method used as well as keyframe decision criteria are introduced as presented in
Section 3.1. The tracked 2D features in the color images are combined with the pixel values in the same location of the depth images, which are aligned to the color frame. Then, the 3D points together with their coordinates in the image frame are passed to the following initialization or VIO process. Since the D435i sensor is using stereo infrared cameras with additional structure light, and also due to the intrinsic characteristic of depth sensor, holes and some patterns (zero measurement) of structure light remain, although the depth images have been processed by the hole-filling filter RealSense provides. On the one hand, if a 3D feature is observed multiple times in different frames, we can reject zero and other incorrect depth measurements with our depth verification algorithm. Points which cannot achieve relatively accurate depth are further passed to the original VINS-Mono triangularization process, where their depth can be estimated, as long as they are observed in two different color frames.
4.2. System Initialization
After the tracking module, 3d points are extracted from each image with respect to the image frame which first observed them. VINS-Mono is using a window based optimization. We first need to set the value of the state vector, i.e., Equation (
8). In VINS-Mono, since only the color source is available, the initial poses of the camera, hence, the IMU in window, are solved using the five-point algorithm [
66]. The leads to the up-to-scale problem. Although the up-to-scale poses are then aligned to the IMU to achieve absolute scale, since IMU are always suffering noise, especially in ground robot applications and on low-cost IMU, the results of alignment are not always good and stable. With the depth source available in our VINS-RGBD approach, it is first feasible to perform the absolute scale pose initialization of cameras using PnP (Perspective-n-Point) algorithms [
67,
68,
69]. The PNP problem is estimating the camera poses by using 2D features in the images and their corresponding 3D points in space. The difference of five-point algorithm and PNP algorithms is illustrated in
Figure 3.
Equation (
6) is then rewritten without scale
s,
As shown in Equation (
7), by getting the IMU pre-integration term
, the gyroscope bias is calibrated initially. Since the scale is known, later parameters initialization only contains velocities and gravity vector, which can be calculated by minimizing terms
,
of Equation (
9) in window with respect to variables
. Thus, the state vector can be defined as:
The linear measurement model is:
The linear least square problem is then defined by adding all the residual terms:
After solving the least square problem, another gravity refinement approach is applied as in original VINS-Mono, since the magnitude of the gravity is always known. By taking advantage of this information, the gravity vector can be modelled as a unit vector plus two DOF perturbation in its tangent space. Note that the acceleration bias does not include in the initialization process, since the gravity vector is a few orders of magnitude larger than the bias, which is making it hard to recover. This affects the scale recovery to a certain degree in VINS-Mono, however, has less influence on our VINS-RGBD.
4.3. Depth-Integrated VIO
The depth-integrated VIO process takes over the image streams once the initialization is done. The process start from the last state
in Equation (
8). Then, the window slides once to marginalize the oldest or latest frame based on a parallax calculation, where keyframes are also decided. The next new
is propagated by IMU integration and then added to the window. Based on all states in the window, in VINS-Mono points which do not have depth are triangulated to achieve depth. In our implementation, the depth is attached in each observation frame of a feature. Thus, a consistent depth verification Algorithm 1 can be applied to filter noisy depth measurements. For the points which are beyond the depth sensor’s range, the original triangularization method is used. Note that the sensor’s range does not need to be equal to device real range. It can be manually set by the user. This way the user can compensate for the increased error for faraway points in RGBD sensors.
Once we have all features in the window with their verified depth, the camera visual residual, i.e., the reprojection error, can be calculated. With IMU residual and marginalization prior in Equation (
10), the Ceres non-linear optimization solver [
70] is used to solve the problem. However, for low-cost IMUs such as Realsense D435i is using, noise and bias are in bad condition, which violates the modelling in Equation (
1). Hence, we set the features whose depth come from depth verification algorithms as constant while leaving the features whose depth come from triangularization still as optimization variables, as
Figure 4 shows.
Algorithm 1: Depth verification algorithm. |
|
4.4. Loop Closing
After detecting a loop using the Bag of Words approach, VINS-Mono is estimating the transform between the two frames using a PnP approach on the 3D points estimated by triangulation. Thus, our VINS-RGBD is automatically also improving the loop closing transform estimation, since the used depths are now provided by the more reliable RGBD sensor.
4.5. Mapping and Noise Elimination
Different from Kinect Fusion [
14], which uses GPU accelerated ICP [
71] for finding matched points between two depth images, to make the system compact and suitable for small robot integration, we simply project 3D points from each depth image, which are synchronized with keyframes with their corresponding poses estimated by VINS, to build the map. With this straightforward method, however, it does not match corresponding points among frames, resulting in a lot of redundant points.
In order to better manage the point clouds, an octree [
72] structure is maintained. Octree is a tree-like data structure used to describe three-dimensional space, each node of which represents a volume element of a cube. A parent cube can be at least cut into eight equal child cubes. Such the number of child nodes of any node in the tree cannot be anything but eight or zero. The volume elements represented by the eight child nodes are added together to be equal to the volume of the parent node. The resolution defines the size of the smallest voxel. A basic structure of octree is as shown in
Figure 5.
To achieve real-time low-cost mapping and navigation, we first downsample every depth image with step size 10 to get around 3000 points, while the original depth image contains more than 300,000 points at a resolution of 640 × 480. Note that these points are inherently different from the feature points that the tracking thread provides. They are able to provide dense information, which is useful in navigation tasks, especially in the ground robot application, since they are not only doing obstacle avoidance but also interaction with obstacles in environments such as going over stairs.
After downsampling, we set the upper bound of each leaf node in the octree to only contain five map points. Any extra points coming later are rejected and also removed from the point vector attached on each keyframe, as
Figure 6a shows. This may loss information, however, the resolution of the octree can always set high enough to satisfy the application demand. Also note that we are not aiming to do a 3D reconstruction, but to build a map applicable to ground robots, especially for rescue robot navigation.
On the other hand, when a loop is detected and closed, we need to update all the points after loop closure, where the time consumption is linearly related to the number of points. Setting the upper bound of the capacity of each leaf node significantly reduces the number of newly added points, ensuring the performance of real-time mapping. However, the loop closure algorithms minimize the error and thus may change all poses in the loop circle. There might still exist a few poses which have a significant error. The 3D points of those frames ideally should not be projected to the map. To achieve this we set a threshold on the number of points in a leaf: Points in leaves containing a fewer number than the threshold points are discarded. The threshold is depending on the velocity, frame rate and keyframe selection strategy. For instance, if the robots move fast, it may occur that most of the numbers of points in octree leaf nodes are less than the threshold and be discarded incorrectly. Besides, the depth image provided by the depth camera has noise which considered as random. Our elimination strategy can also discard them, thus leading to a more consistent map. The elimination process is shown in
Figure 6b.
6. Conclusions
We proposed a system to fuse color, depth and inertial sensors for trajectory estimation and mapping for small ground rescue robots. Our system called VINS-RGBD is based on the state-of-the-art software VINS-Mono. We are taking advantage of the RealSense D435i, which is small, lightweight and has an integrated IMU. We extend the VINS-Mono system to make use of the depth data during the initialization process as well as during the VIO phase. Exhaustive experiments and comparisons with handheld, wheeled robots and tracked robots are performed to evaluate the performance and versatility of our system. Our software is available as open source and the dataset is also provided. The experiments show the excellent performance of our system compared to the ground truth trajectories, to the VINS-Mono system and also to an ORB-based approach. We show that for ground robots, especially for tracked robots, the noise of the IMU readings is very high due to the vibrations induced by the ground contact. More importantly, under some special motions common for ground robots, the acceleration bias cannot be estimated since it is unobservable. This is a problem for traditional VIO systems, that are relying heavily on the IMU to estimate the unknown scale factor. Using depth information we can prevent the IMU from having a too big negative effect on the results, thus enabling our solution to provide good maps for applications such as path planning or navigation.
In this work, we fuse the RGBD camera data with both gyroscope and accelerometer measurements from the IMU. However, in some extreme situations such as very challenging terrain, the accelerometer is susceptible to high-frequency noise and becomes unreliable. One naive way is to only fuse the RGBD camera images with the gyroscope and don’t use the accelerometer, since the depth camera already provides absolute scale information. However, both accelerometer and depth camera have their degeneration conditions. Using both accelerometer and depth cameras also has advantages, e.g., the gravity vector estimated from the accelerometer can correct the pitch and roll angle of the camera. Thus, we keep both and are working on finding better ways to model the accelerometer, especially in extreme conditions. With a pose ground truth, such as recorded in our dataset, it is easy to get accelerometer ground truth and make it possible to use deep learning to model the accelerometer or even IMU more correctly. With better sensors modelling, we expect to achieve more accurate trajectory estimation and mapping results.
To build a whole navigation system, we are also working on flipper planning of our tracked robot under the limited environmental perception, since RGBD cameras typically only provide one view of the environment rather than 360 degrees by LIDAR. The 3D path planning is also a further topic together with ground robot navigation map generation, by utilizing the consistent point cloud information provided by our VINS-RGBD system.