1. Introduction
Simultaneous localization and mapping (SLAM) has developed quickly in recent decades and has been widely used in fields such as autonomous driving, augmented reality, and robotics. The vision-based method has attracted widespread attention because cameras can capture rich details of the scene. The conventional visual SLAM extracts static geometric features in the environment, such as points, lines, and planes, and achieves high-precision localization and mapping [
1]. Some also use the global appearance of whole images such as global-appearance descriptors [
2,
3], or gist descriptor [
4] for mapping or localization, which model more information in an image compared with features.
However, considering a service robot that works in human-robot coexisting environments such as homes, museums, offices, some commands like “Move to the TV”, “Hand over the cup on the table” contain instructions to interact with objects in the environment, which requires the robot to have a semantic understanding about where the objects are. Instead of a specific coordinate point, we want the robot to understand and reach the target area. The robot needs to have an object concept, such as its semantic label, location, and roughly occupied space. It may also need to know orientation information to wait in the side of a chair and avoid standing in the human’s way. Therefore, a map that only contains geometric features in the environment is far from meeting the demand.
A service robot needs to work for a long time in the work scene, which arises the study of life-long robots [
5]. A life-long robot needs to adapt to illumination changes, sensor noise, and significant viewing angle changes—all of those put forward requirements for the robustness of landmarks in its map. The existing visual SLAM technology relies on feature descriptors [
6,
7], surface textures [
8,
9], or the geometric position of plane and line structures [
10,
11,
12] to solve data associations. However, artificially designed descriptors are challenging to adapt to significant viewing angle changes and are easily disturbed by illumination and sensor noise. Besides, the robot needs to adapt to long-term environmental changes, such as chairs, teacups, and other objects placed at will, new furniture added or removed. Maps based on point, line, and plane features are difficult to update directly, while an object-centric map can provide much convenience.
We propose that as essential components of the indoor environment, objects have geometric properties such as positions, orientations, and occupied spaces. We can use them as landmarks in SLAM. With the development of deep learning, the object detection algorithm [
13] can provide bounding boxes with semantic labels representing object areas in images. It adapts well to viewing angle changes, illumination, and occlusion, and can serve as a sensor to detect objects.
Quadrics, as higher-dimensional representations than points, lines, and planes, have a well-grounded multiple view geometry theory [
14] and can compactly model the position, orientation, and shape of the object. QuadricSLAM [
15] first proposed introducing quadrics as an object model into SLAM and optimizing the quadrics parameters using the bounding boxes from the object detection in a factor graph. However, the monocular observation model makes the quadrics’ initialization difficult, as it needs to get at least three observations with massive viewing angle changes to converge. Since it is challenging to generate diversified viewing angles under a mobile robot’s planar motion model, unobservable problems [
16] happen easily. Moreover, QuadricSLAM leaves the data association problem unsolved.
We propose to introduce depth data into the object-level SLAM system. An RGB-D camera is commonly used for mobile robots, which is low-cost and commercially available. It offers RGB images and depth data and is suitable for indoor environments. Laser and Lidar generate a 2D or 3D point cloud specifically. However, they lack visual information for scene detail. Combining a monocular camera with a Lidar may be a suitable solution for implementing our algorithm in outdoor environments. Currently, we will discuss the RGB-D camera only in this paper.
Based on previous research, we propose a sparse object-level SLAM using quadrics based on RGB-D cameras commonly used by mobile robots. We propose two types of RGB-D camera-ellipsoids observation models: complete observation model and partial observation model. The complete model extracts a complete ellipsoid from a single RGB-D observation based on the relationship between the object and its supporting plane. Aiming at the point cloud missing problem caused by occlusion and invalid depth data in the RGB-D frame, when the missing is severe, we use the bounding boxes from object detection to form a partial observation model. We propose a novel evaluation method to flexibly switch the two observation models to adapt to the received RGB-D data. For the data association problem that the state-of-the-art algorithm has not solved yet, we introduce a nonparametric pose graph [
17] to solve in the back end robustly. The conventional SLAM tends to evaluate the trajectory quantitatively and ignores the map. Instead, we address the mapping evaluation and propose a benchmark that considers objects’ poses, shapes, and map completeness to comprehensively evaluate the mapping effect. We compared our algorithm with two state-of-the-art object-level algorithms, QuadricSLAM [
15], based on a monocular quadric observation model, and a point-model object SLAM [
17], based on a nonparametric pose graph using points as an object model. We ran experiments on two public datasets, ICL-NUIM [
18] and TUM-RGB-D [
19], and an author-collected mobile robot dataset in a home-like environment.
In summary, the contributions of our work are as follows:
We propose an object-level SLAM algorithm using quadrics as an object model and propose two RGB-D quadric observation models.
We propose a method to extract a complete ellipsoid from a single RGB-D frame based on the relationship between an object and its supporting plane.
We innovatively introduce a nonparametric pose graph to the quadrics model to solve the semantic data associations in the back end.
We have thoroughly evaluated the proposed algorithm’s effectiveness compared with two state-of-the-art object-level SLAM algorithms on two public datasets and an author-collected mobile robot dataset in a home-like environment.
4. Experiments
Datasets. To evaluate the algorithm’s performance in different scenarios, we ran experiments on the ICL-NUIM dataset [
18], the TUM-RGB-D dataset [
19], and an author-collected dataset recorded by a mobile robot in a home-like environment as in
Table 2. The public dataset uses a handheld camera trajectory. The trajectory of ICL-NUIM covers two scenes of home and office, and the TUM-RGB-D dataset provides a total of six scenes, including three offices, one desktop and two human-made scenes. We recorded the real mobile robot dataset in a home-like environment. It contains more linear motion modes and small viewing angle observations, which can test the algorithm’s performance on the mobile robot more realistically.
Baselines. During the experiment, we chose three baselines to comprehensively evaluate the performance of our algorithm: ORB-SLAM2 [
20], QuadricSLAM [
15], and a point-model object SLAM [
17]. ORB-SLAM2, as a state-of-the-art SLAM method based on feature points, can adequately represent the performance of conventional SLAM. We use ORB-SLAM2 with loop closures disabled as ORB-VO, and input it as odometry to three object-level SLAM algorithms to measure the improvement on trajectory accuracy after adding object landmarks. The point-model object SLAM [
17] (NP for short) is a novel object-level SLAM based on a nonparametric pose graph, which uses points as an object model. This paper will compare the advantages of the quadrics model in modeling objects with it. To verify the proposed RGB-D camera-ellipsoid observation models, we also chose the state-of-the-art QuadricSLAM as a baseline. It is a monocular SLAM algorithm based on a monocular quadric observation model. Since it does not solve the data association, we will use the same association result as ours in the experiment to verify the mapping effect and localization accuracy.
The three object-level SLAM algorithms all use state-of-the-art YOLOv3 [
13] to obtain object detection during the experiment. It is trained on the Coco dataset [
44] and can recognize more than 80 everyday objects. We select a keyframe every ten frames in each dataset. This frequency can ensure sufficient observation constraints and avoid redundant observations.
Evaluation benchmarks. To adequately measure the SLAM algorithm’s localization accuracy, we use the absolute trajectory error to evaluate the estimated trajectory compared with the ground-truth trajectory. We calculate the root mean squared error (RMSE) as the metrics. The construction of a semantic map is the focus of this paper. The conventional SLAM algorithms generally use qualitative evaluation for mapping or reflect it in the localization accuracy. To fully evaluate the effectiveness of mapping, we propose the following metrics as benchmarks. The Trans(m) measures the center distance between the estimated object and the ground-truth object. The Shape evaluates the object’s occupied space. To remove the influence of translation and rotation, we will first translate both estimated object and ground-truth object to the origin, align their rotation axis and finally calculate the Jaccard distance (1-Intersection over Union) between their circumscribed cuboids. For the objects in
Table 1 with dominant directions, Rot(deg) evaluates the minimum rotation angle required to align the estimated object’s three rotation axes with any axis of the ground-truth object to a straight line. For a trajectory, the above metrics are the average values of all correctly instantiated objects’ evaluation results. Since the baseline NP’s point model does not have the concept of rotation and occupied space, we only evaluate Trans(m) for it.
We also pay attention to the proportion of correctly instantiated objects in the map, and the instantiation success rate. We use the precision to measure the ratio of the successfully instantiated objects
and the total instantiated objects
. The recall measures
and the total objects in the trajectory
. Since
P and
R trade each other, we use
F1 as a comprehensive value that fully considers the two.
In actual experiments, we count objects in the trajectory with at least three observations of object detection in . We define that when an estimated object’s semantic label is consistent with the ground-truth object, and the center distance is within 0.3 m, it is considered a correctly estimated object. Only correctly estimated objects will participate in the evaluation of Trans, Rot, and Shape.
The precision and recall comprehensively consider the effect of mapping and data association. When the mapping effect is poor, and the object is too far from the ground-truth, it will reduce the precision and recall. When the data association is chaotic, generating multiple instances from observations that originally belong to one object, it causes the precision to decrease. When different objects’ observations are wrongly associated to one object, recall will decrease. There is a trade-off between the precision and recall. By adjusting the false positive filtering parameter of the system, we can adjust the precision and recall. In actual application, identifying false positive objects may introduce wrong loops and affect the entire system. Therefore, we set up stricter conditions (such as more observations of objects) to get higher precision.
The above metrics can quantitatively reflect the effect of the overall algorithm in localization and mapping. We also qualitatively compare the mapping results with two object SLAM baselines to show the algorithm effects.
4.1. ICL-NUIM Dataset Experiment
The ICL-NUIM virtual dataset provides the rendering results of a home and an office scene. The home scene includes everyday indoor objects such as sofas, chairs, TVs, tables, vases, while the office scene includes monitors, tables, cabinets, etc. The experiment tends to verify the performance of mapping and localization in a room-scale scenario.
ICL-NUIM provides ground-truth trajectories to evaluate estimated trajectory accuracy. Although it is a virtual dataset, both the RGB and Depth images contain simulated Kinect2 camera noise to reflect the robustness of the algorithm to actual noise. Since the dataset does not give the objects’ poses and shapes in the scenes, we manually annotated the ground-truth parameters of the objects based on the ground-truth point cloud model of the dataset. The ground-truth point cloud model will also be displayed on the qualitative mapping result as a reference.
4.2. ICL-NUIM Dataset Experimental Results
Figure 10 demonstrates the effects of the complete observation model and the partial observation model in the dataset and the extraction of the Manhattan planes. In the home scene, the trajectory is sometimes too close to objects, so there are many occlusions of objects. When the observation of the sofa is complete, the algorithm extracts a well-fitting ellipsoid as in
Figure 10a. When the image boundary occludes part of the sofa, the algorithm marked the occluded part as invalid, as shown in gray in
Figure 10b. The table in
Figure 10a,b selected the complete model and the partial model in different situations, and filtered the occlusion constraint planes, which marked gray in the figure. In
Figure 10b, the lower part of the chair is occluded by the table, making
smaller, so the algorithm activated the partial constraint model instead. The observation model proposed in this paper can maximize observation information rather than directly eliminate it.
Figure 10c shows that the algorithm extracts complete ellipsoids for TV, chairs, and vases.
Figure 10d–f demonstrates the effect in the office scene. Compared with the home scene, the trajectory is farther away from the objects, which brings difficulties to the point cloud segmentation. We can see from the figure that the algorithm has good adaptability to common office objects such as books, monitors, keyboards, as shown in
Figure 10e. When the object is too small to extract the point cloud, it will switch to partial constraints.
The quantitative results of the ICL-NUIM dataset are shown in
Table 3. We found that our algorithm’s trajectory accuracy in the home and office scenarios is slightly improved by 6.0% and 6.6%, respectively, compared to ORB-SLAM2. We believe that the improvement of trajectory accuracy comes from two aspects: (1) object features such as sofas, TVs, displays, have apparent dominant directions, providing reliable and precise constraints; (2) home/office scenes have walls, floors, ceilings, etc. Those low-texture fragments low down the accuracy of ORB-SLAM2, and the constraints of objects add more information to the system. In other object-level baselines, NP and QuadricSLAM also have a small improvement to ORB-SLAM2.
The evaluation of mapping is shown in
Table 4. Due to the RGB-D observation models, our algorithm has achieved better translation, rotation, and shape metrics than QuadricSLAM. In terms of the object’s orientation, due to the estimation based on the support relationship and the dominant direction of the object surface, it is significantly improved compared to QuadricSLAM, which only constrains the object based on the bounding boxes. In the office dataset, we achieved a lower object translation error than QuadricSLAM. However, there is a slight increment compared with NP. We think it comes from the error of point cloud segmentation as the camera is far away from the objects in the ICL-Office dataset. We will give a more detailed discussion in Failure Cases. In the accuracy and recall rate, ours achieved advantages over baselines.
Table 5 shows the number of semantic objects in the dataset. We successfully initialized 73% and 67% of objects in the home and office dataset. Compared with the baselines shown in
Table 5, ours achieved a better precision and recall of mapping. The precision and recall in the office scene are lower than the home dataset. We think the error comes from its trajectory, which is far from objects, as shown in
Figure 10d–f. It brings challenges to object detection and the point cloud segmentation. Due to the small number of valid observations, some objects have not reached the initialization condition. We believe that a reasonable selection of the object-oriented observations in actual applications will bring more improvement.
Figure 11 shows the qualitative mapping results. The NP algorithm restores the center point of each object based on the point model. In contrast, the algorithm based on the quadric model restores the orientation and occupied space beside the object center, providing more information for mobile robot navigation. QuadricSLAM is based on the observation model of a monocular camera. An ellipsoid needs more than three observations to converge, and the observations require significant view angle changes. In the ICL-NUIM dataset, the camera moves in a circle around the room. Some objects have too little observation, and the changing angle of view is small, resulting in weak convergence of the axis length of the object along the observation direction, and some objects fail to appear on the map due to initialization failure. Due to the RGB-D observation model, ours restores the center and shape of the object better. Ours only needs one complete observation to initialize the objects in the object map, which instantiates more objects than QuadricSLAM. Due to the statistical histogram proposed in this paper, the orientations of objects with dominant directions such as sofas, TVs, chairs, and tables are well estimated. In Ours-Vis, we visualized the point cloud map of the dataset as the background and visualized the estimated ellipsoid’s circumscribing cuboid. We can see that the object estimated by ours can fit the outer contour of the real object, which shows that although the quadric surface is a rough object model, it can meet the geometric information requirements of the robot navigation.
4.3. TUM-RGB-D Dataset Experiment
Compared with virtual datasets, real environments face multiple challenges such as sensor noise, illumination changes, motion blur, and occlusion. The TUM-RGB-D dataset provides the trajectory generated by a handheld Kinect recording in real scenes. We selected eight trajectories covering different desktops and low texture scenes for experimental evaluation. The dataset does not provide a ground-truth point cloud. To annotate ground-truth objects parameters to evaluate the mapping effect, we input the ground-truth trajectory and RGB-D data provided by the dataset into ElasticFusion [
45]. ElasticFusion optimizes input data and obtains an accurate point cloud map. Then we annotate ground-truth objects based on the map. The point cloud mapping result will also be displayed on the qualitative mapping result as a reference.
4.4. TUM-RGB-D Dataset Experimental Results
Real datasets bring more challenges than virtual datasets. As shown in
Figure 12, the larger objects, such as monitors, teddy bears and cabinets, are estimated to be complete, while some small objects, such as cups and books, are challenging to generate complete constraints due to the small number of point clouds. Also, there are holes produced by the black display, making the point cloud of the object incomplete. In these cases, the complete constraint’s evaluation function will help the algorithm switch to partial constraints. Flexible switching of complete constraints and partial constraints enables our algorithm to maximize the use of observation information.
The experimental results of trajectory accuracy are shown in
Table 6. Our algorithm can significantly improve trajectory accuracy in low-texture scenes such as fr3_dishes and fr3_cabinet, reaching an improvement of 53% and 32% compared with ORB-SLAM2, respectively. ORB-SLAM2 is difficult to obtain enough feature points to achieve robust pose estimation in low-texture scenes, where the object landmark reflects its superiority. However, some desktop scenes are cluttered and bring difficulties to data association, making the accuracy improvement small. While ORB-SLAM2 has achieved very high accuracy due to its complete loop closures, the trajectory accuracy in those desktop scenes is not significantly improved. In general, our algorithm has achieved a better accuracy improvement than NP and QuadricSLAM.
We also noticed that, in some datasets, such as fr2_desk and fr3_long, when we turned off loop closures as ORB-VO, we achieved even better accuracy than ORB-SLAM2. We think it is the error introduced by the wrong loop closures or inaccurate loop constraints.
For the mapping effect shown in
Table 7 and
Figure 13, the proposed algorithm has distinct advantages over the baselines. We obtained the best Trans, Rot, and Shape metrics on most trajectories. Ours achieved 6.0 cm translation, 8.6 degrees rotation, and 60% IoU on average, which increased by 21%, 77%, and 40%, separately compared with QuadricSLAM. The center of the object has a 22% increase compared with the point model-based NP. Specifically, ours gets better rotation and shape compared with QuadricSLAM in the fr3_teddy dataset, while the translation is not as accurate as it in the fr3_teddy dataset. We summarize two possible reasons: First, as the camera is too close to the teddy bear, the center of the objects along the camera is not accurately estimated. Second, the trajectory is travelling around a circle, which generates large view angle changes for the monocular-based QuadricSLAM to converge.
Table 8 counts the number of all trajectory objects in the TUM-RGB-D dataset, covering a total of 100 objects. The cluttered scenes on the desktop are challenging for data association, as objects are very close to each other, and nearby objects have the same semantic label. Ours successfully instantiated 79% of them and achieved 78% precision, which is comparable with NP. Considering that the quadric observation’s data association is more complicated than the point model, experiments proved the effectiveness and the robustness of the nonparametric data association
Algorithm with the quadrics model. There is still much room for improvement. In the experiments, we found that the uncertainty of semantic labels caused by object detection caused one object to be instantiated into multiple objects. As quadrics models objects’ occupied space, we can further improve the association’s accuracy based on reasonings such as “objects cannot be overlapped” in future work.
From the qualitative experimental results of
Figure 13, we list each trajectory’s experimental results in detail. It is evident from the qualitative experiment that the quadrics model contains more object information than the point model. Ours-Vis shows that the ellipsoids constructed by our algorithm fit the objects well, especially for objects with apparent dominant surface directions, such as monitors, keyboards, books, and cabinets. The object constraints serve as useful supplements to feature-based landmarks, especially under low-texture scenes.
4.5. Real Mobile Robot Experiment
To verify the algorithm’s performance on the trajectory of a real mobile robot, we used a Turtlebot3 robot equipped with a Kinect2 to record in a real home-like environment, which includes a total of 10 common indoor objects such as televisions, sofas, beds and potted plants as in
Figure 14.
Compared with ICL-NUIM and TUM-RGB-D dataset, the trajectory of a real mobile robot is more challenging. The observed object tends to maintain a small observation angle change in the vertical and pitch directions and fewer valid observations. This experiment can better verify the performance of the algorithm for mobile robot navigation than the public dataset.
Considering that the ground-truth trajectory data cannot be obtained, this experiment focuses on verifying the object-level mapping effect on the mobile robot. We build a point cloud map based on the trajectory of ORB-SLAM2 with multiple loop closures and manually annotate the poses and shapes of objects in the map as ground-truth. This experiment will focus on comparing the mapping effect with the two object-level SLAM baselines.
4.6. Real Mobile Robot Experimental Results
Figure 15 demonstrates the challenge of this dataset. For example, when the mobile robot passes by the couch, limited by the observation angles, the point cloud of the couch is incomplete. There are two couch observations in the figure. The first one has enough information to estimate a complete ellipsoid and then filter the bottom occlusion constraint plane. On the right, the observation is occluded by the previous couch, which is severely incomplete, so it activated the partial constraint model to maximize information usage.
According to the experimental data in
Table 9, the algorithm proposed in this paper shows distinct advantages for a mobile robot. QuadricSLAM, based on the monocular observation model, is challenging to adapt to the mobile robot’s motion, which contains small viewing angle observations. Ours, with an RGB-D observation model, maintains stable performance. Compared with QuadricSLAM, ours improves translation, rotation, and shape by 41%, 94%, and 63%. Evaluating the rotation angle, large objects such as televisions, sofas, beds, and cabinets have successfully estimated their rotation angles and reached a high angle estimation accuracy of 2.8 degrees, which can effectively help the robot to determine the semantic orientation of the object. Compared with the NP algorithm, ours improves the translation by 46% due to the better handling of the occlusion situation.
Evaluating the map’s completion, we have successfully recovered all the objects recognized by the object detection as in
Table 10. Due to the efficient and robust data association, the accuracy has reached 91%, far exceeding the two object-level baselines. However, the sofa is recognized as a chair by the object detector several times under a small viewing angle as in
Figure 16, making the small sofa incorrectly instantiate one more object. This problem has also appeared in previous public datasets. We will discuss it further in Failure Cases.
Figure 17 visualizes the comparison between the result of the estimated object and the ground-truth object. Ours fits the ground-truth object better. When navigating as a semantic robot, the map can help it understand the scenes and execute semantic commands such as “Move to the TV.” The navigation based on the semantic map will be our future direction.
Table 11 shows the effect of each category of objects on the real-home dataset when we reject all the partial observations. We ignore those bounding boxes if their edges are close to the image border less than 30 pixels. When the mobile robot travels in the room, there are many partial observations because of its viewing angle. It will cause a large amount of information loss if we reject all of them directly. Especially, all the observations on the couches are partial. Overall, the introduction of partial constraints has significantly improved the object translation accuracy.
4.7. System Modules Analysis
We showed the orientation estimation results of three representative types of objects in three datasets in
Table 12. Those objects with a dominant direction, such as the couch, TV, monitor, and cabinet, show high accuracy. While those objects are flat and have small areas along the
Z-axis, such as book and keyboard, get lower accuracy. After optimization, the accuracy of all objects increases. As for the application for indoor mobile robot navigation, small objects such as books and keyboards are movable by human beings. We pay more attention to the objects that are large and stable to serve as reliable landmarks.
We evaluate the number of valid and invalid constraint planes for each dataset’s representative trajectory in
Table 13. The occlusion of image edges causes invalid constraint planes. We use Precision and Recall to evaluate the ability to detect them. After filtering, the percent of invalid planes has reduced to around 0.3–2.5%. For the remaining outliers, we use the Huber loss in the optimization to low down their influence.
4.8. Failure Cases
Some situations that are difficult to deal with in the experiments have been discovered, which guides our future work. The errors mainly come from two aspects:
Wrong point cloud segmentation. This error mainly comes from the point cloud processing of the complete constraint model. For example, as potted plants have too thin lines, the Euclidean clustering produces only part of the plants. Due to the simple supporting planes judgment method, some objects on the ground are mistaken for being located on the desktop, so their bottom point cloud is filtered, resulting in errors. Most of the problem of point cloud segmentation can be turned into a partial constraint through the judgment of , and some of the wrong cases that meet the requirements will introduce errors to the algorithm.
Wrong object detection. We used the YOLOv3 object detector and did not conduct targeted training for the indoor environment. Therefore, there is confusion in the detection of some objects. The object detection treats the right half part of the cabinet in the ICL-Office scene as an object most of the time. Small objects produce semantic confusion, resulting in low precision in the scene. For example, the monitors are wrongly detected as laptops several times. When there are many wrong labels, it will confuse the data associations to generate extra instances and low down the precision.
Therefore, the existence of the above problems inspires our future work. First, introduce a more robust object–support relationship judgment module, e.g., judging the relationship between the object and the supporting plane in global reasoning. Second, introduce relationships between objects, such as “objects cannot be overlapped”, to obtain a more reasonable data association. Finally, train the object detector on the indoor scene dataset to improve the detection accuracy and robustness.
4.9. Computation Analysis
We implement the proposed algorithm in C++, using the g2o library [
46] for graph optimization, and used PCL [
47] for point cloud processing. We ran the algorithm on a desktop PC with an AMD Ryzen5 3600 3.6 GHz CPU, 16 GB RedAM, and Ubuntu 16.04, as in
Table 14. We present the time-consuming on the real-robot dataset as in
Table 15.
It is worth mentioning that the back end runs in parallel with the front end, which can meet the mobile robots’ navigation requirements of real-time map-building and path planning. Adjusting the number of iterations can affect the convergence of object data associations, camera poses, and objects parameters. We ran iterations five times in the experiments. All the modules run on a CPU except for the object detection. The emergence of lightweight neural networks [
48] recently makes object detection achieve high frame rates on a CPU, making the proposed algorithm available to run totally on a CPU in the future.
4.10. Memory Usage
We summarize the memory usage compared with two state-of-the-art systems in
Table 16 and summarize the map storage in
Table 17. Our map comprises objects with an ellipsoid consisting of nine parameters, and a semantic label comprising one parameter. We store values using the float type and store label using the unsigned char type. In a mobile robot application in a room size on the real-robot dataset, our memory usage is only 1.2 GB. And our map is only 1.63 KB. They show significant advantages compared with the dense methods.
5. Conclusions
This paper proposes introducing artificial objects in the indoor environment into SLAM as robust landmarks and builds an object-oriented map, which extends the traditional SLAM’s ability to understand indoor scenes. This paper proposes an object-level semantic SLAM algorithm based on RGB-D data, which uses quadrics as an object model, and compactly represents the object’s translation, orientation, and occupied space. This paper proposes two types of camera-ellipsoid observation models based on RGB-D observation data compared with the state-of-the-art monocular quadric SLAM systems. Among them, the complete observation model uses the relationship of the spatial structure plane to estimate the ellipsoid parameters from a single frame of RGB-D data. To emphasize the point cloud missing and occlusion problems, we propose a partial observation model and an evaluation function to flexibly switch between the two types of models.
The state-of-art quadric-based SLAM leaves the data association problem unsolved. This paper introduces the nonparametric pose graph and integrates it with the proposed RGB-D observation model to solve the data association in the back end robustly. Under ICL-NUIM and TUM-RGB-D datasets, and a real mobile robot dataset recorded in a home-like scene, we proved the quadrics model’s advantages. We increased the localization accuracy and mapping effects compared with two state-of-the-art object SLAM algorithms. Semantic navigation based on the object-level map, a more robust module to find the supporting planes of objects, and global scene understanding are valuable future work.