1. Introduction
Simultaneous localization and mapping (SLAM) is a critical problem in robotics and computer vision, which involves building a map of an unknown environment while simultaneously estimating the robot’s location within the map [
1,
2,
3]. In recent years, RGB-D cameras have emerged as a popular sensing modality for SLAM systems, as they provide both color and depth information of the environment (
Figure 1).
Scene classification in SLAM models that rely on the use of RGB-D cameras is a challenging task due to a number of factors [
4,
5]. Conventional image classification techniques like oriented FAST and rotated BRIEF (ORB) [
6] and scale-invariant feature transform) [
7] have been employed for scene classification within the SLAM context, utilizing only the 26 RGB channels. Yet, they do not consider depth. To address this problem, we propose a new method for scene classification in SLAM using the boundary object function (BOF) descriptor [
8] on RGB-D points.
The BOF descriptor is a powerful technique for feature extraction and classification in computer vision. It converts the distance from the centroid to the points in the border of the object for each object found in a scene. The obtained distances are then used as the basis to classify the scene.
From an RGB-D camera, we extract points and fit them into orthogonal layers that are orthogonal to the camera plane. From each layer, we extract the boundaries of the detected objects, such as furniture, ceilings, walls, doors, etc. The extracted features are then classified using a machine learning method.
In this paper, we propose a new method for scene classification using the BOF descriptor on RGB-D points. Our method takes advantage of the RGB-D information provided by the camera and provides more robust and discriminative features for 3D scenes. We also use the concept of bag of visual words that are classified by an SVM, which allows us to handle complex scenes with high accuracy. Our experimental results demonstrate the effectiveness of the proposed method in terms of accuracy and robustness in different indoor scenes. We also provide experimental results to demonstrate the effectiveness of the proposed method in terms of accuracy and robustness in different indoor scenes.
The rest of the paper is organized as follows:
Section 2 presents an overview of existing studies and contrasts them with the unique contributions of our research.
Section 3 presents the proposed method in detail.
Section 4 describes the experimental setup and presents the results. Finally, in
Section 5, we present some conclusions and provide some directions for future work.
2. Related Work
Scene classification using RGB-D cameras is an active area of research in robotics and computer vision. In this section, we provide an overview of the related work in this field as well as a panoramic view of the state of the art.
Traditional image classification methods such as ORB or SIFT have been used for scene classification in SLAM systems. These methods rely on 2D image features and may not be sufficient for classifying 3D scenes accurately. In recent years, several methods have been proposed to address this problem [
9].
A study of an RGB-D SLAM system for indoor dynamic environments used adaptive semantic segmentation tracking to improve localization accuracy and real-time performance, achieving a 90.57% accuracy increase over ORB-SLAM2 and creating a 3D semantic map for enhanced robot navigation [
10].
Also, there is a pressing need to run scene or object detection algorithms on mobile objects such as robots and autonomous cars, where it is necessary to have lightweight algorithms that consume few computational resources (memory, processing time, and power). This is why algorithms based on that precept rescue simple feature extenders, as in [
11]; the authors presented the modified R-ratio with the Viola–Jones classification method (MRVJCM) for efficient video retrieval, achieving 98% accuracy by automating image query recognition and optimizing system memory usage.
The BOF descriptor has been widely applied in several contexts. It was introduced in [
8], where the descriptors allowed an accurate recognition of assembly pieces, including several shapes such as squares and circles; at the time, the orientation was determined by the shadow that the pieces projected. The images from which the BOF descriptors were obtained were taken from a camera located at the top of an assembly facility, which facilitated the detection of objects. A neural network, fuzzy ARTMAP, conducted the classification stage of the pieces, and the results were highly precise for all combinations. In a more recent application [
12], it was applied in a technique to identify objects from several viewing perspectives. A condensed convolutional neural network model, inspired by LENET-5, was employed for the classification phase. This approach was implemented on an FPGA.
The BOF consists of a numeric vector used to describe the shape of an object. It differs from local feature extraction descriptors like SIFT, SURF, and ORB in that it describes the shape of an object but not the neighborhood of a feature point.
The steps to obtain a BOF descriptor are as follows:
Apply an object segmentation procedure.
Detect the contour and centroid of the object.
Quantize the contour into
n points, where
n is the size of the descriptor. With
, the test guarantees a good balance between accuracy and computer performance [
13].
Obtain the distances from the quantized contour to the centroid.
Concatenate the distances in counterclockwise order of appearance.
Normalize the vector (the components are divided between the maximum components).
In recent years, the application of neural networks, in particular those with a deep learning architecture, in the field of scene classification has witnessed a significant increase. Heikel and Espinosa-Leal [
14] implemented a YOLO-based object detector that gives a descriptor of each image this was put in Tf-idf representation; finally, the information was classified using random forest. The pipeline is similar to ours, with the difference being that we use a support vector machine for classification and BOF as the descriptor. Another deep learning approach is an autonomous trajectory planning method for robots to clean surfaces using RGB-D semantic segmentation, particularly employing the double attention fusion net (DAFNet), presented in [
15]. This technique enhances indoor object segmentation and, through various processes, generates a smooth and continuous trajectory for the robotic arm, proving effective in surface cleaning tasks.
In Ref. [
16], the authors combined deep learning and RGB-D sequences to take advantage of all the RGB-D information provided by Kinect. Their efforts included fussing the color and depth information with three techniques, namely, early, mid, and late fusion. A ConvNet-based method was used to extract descriptors due to the capacity of generalization that this type of structure allows. The results were significantly better in indoor scenarios than those obtained by the bag of visual words (BOVW) approach. The main drawback of the this ConvNet-based system is linked to the difficulty of its implementation in real-time situations due to the its high demand for computing power.
Semantic information is an important feature in interactive robot assistants. In Yuan et al. [
17], the authors took advantage of the semantic segmentation provided by the Panoptic feature pyramid networks. This incorporation allows the system to create a semantic codebook, which divides the words in dynamic and static tokens. The rationale behind this approach is that the static words are more meaningful, whereas the dynamic ones have less value. For example, the word
person has a value of zero because people cannot describe a place. Their descriptor is built upon a semantic graph, which also serves to define a similarity function.
Finally, a model in which the use of residual neural networks to optimize traffic sensor placement and a subsequent predict of the network-wide origin-to-destination flows is presented in [
18]. The proposed deep learning model offers high prediction accuracy, relying on fewer sensors, as demonstrated on the Sioux Falls network.
3. Materials and Methods
In this section, we describe the materials and methods used in our proposed method for scene classification in SLAM using the BOF descriptor on RGB-D points.
3.1. Dataset and Platform
We based our experiments on three datasets (
Table 1) for the training and testing stages: the Microsoft 7-Scenes [
19], SUN RGB-D, and OfficeBot TourPath (OBTP) datasets, adhering to the train–test split as prescribed in the original publication [
20]. The three datasets furnish color and depth information about the environment, a crucial requirement for our proposed method.
The results were procured using an Jetson Nano single-board computer (NVIDIA Corporation, Santa Clara, CA, USA) running on Ubuntu 18.04.6 LTS. The system specifications include a CPU clocked at 1.479 GHz and 4 GB RAM.
3.2. BOF Feature Extraction from RGB-D Images
In this method, we use only depth images to extract BOF features by following these steps:
The depth image is transformed into a point cloud, which is a set of 3D points representing the position of the objects in space captured by the image.
The point cloud is divided into layers. The number of layers is a hyperparameter
L that is set before extracting the BOF features. We select an axis determined by a unitary vector
v and project the points to
v.
After that, we obtain the minimum
and the maximum
of these projections and divide the interval
into
L subintervals of length
. Finally, using the function
, which rounds a float to an integer, an index
is assigned to each point
p by the following equation:
All points contained within a layer are projected to a plane perpendicular to the roll axis of the camera. In this manner, points are represented in the form of for further analysis.
For each layer obtained in the previous step, a binary image of resolution
is generated, consisting of ones in the grids containing at least one point in space and zeros where there is no point. To determine if the pixel of the new binary image with index
is 0 or 1, we use an index function
that assigns the two-dimensional integer formed by two coordinates to each projected point
in a layer, which results form the rounding function
according to:
where
and
are the minimums of the projections in the canonical axes
x and
y,
and
. Once the index
is determined, the binary image is constructed following the next rule: given a pixel of the binary image
, if there exists
such that
, we set the value of the pixel
to 1; otherwise, the value of the pixel
is set to 0.
The binary image is smoothed to eliminate the gaps caused by the low resolution of the point cloud. Smoothing is achieved using a closure morphological operation.
For each binary image, closed contours are found.
For each contour, the BOF descriptor is extracted following the steps discussed in
Section 2.
All extracted BOF descriptors are stacked and associated to the frame.
Figure 2 illustrates the aforementioned process. It is important to note that only the depth image is take in into account, and the RGB image is kept aside. In
Figure 2c, the multiple layers display objects highlighted with 1’s. A filter smooths the binary images to minimize noise. In
Figure 2d, the Boundary Object Function is extracted solely from objects where the contour comprises a minimum of 1% of the total area.
3.3. Scene Classification
As a complement to autonomous navigation, scene recognition [
22] endows an intelligent system with the ability to localize itself and understand the context of its surroundings. By recognizing the place where it is located, the intelligent system can adapt its actions to achieve its goals, e.g., for the case of a mobile robot, to move from one point to another or to plan based on location-derived information.
For this purpose, a scene recognition system based on traditional methodologies is proposed. This scheme is presented in
Figure 3.
For the feature extraction stage, the traditional methodologies include algorithms such as SIFT, SURF, and ORB. In the feature transformation stage, BoVW approaches are commonly applied. For the classification stage, models such as support vector machine (SVM), random Fforest, naïve Bayes, or k-nearest neighbors (kNN) are commonly applied.
The contribution of this work involves following the BOF perspective as a feature extraction method. The reason for this is the relatively low computational demand required for obtaining of this descriptor compared with that of other commonly used local feature extraction schemes, such as the mentioned SIFT, SURF, and ORB methodologies.
SLAM algorithms need a loop closure mechanisms to ensure the correct generation of the map, detecting revisited places in order to add consistency and robustness. When the main sensor of the robot is a camera, it is referred to as appearance-based loop closure detection. In [
23], these mechanisms belong to two categories, namely, offline and online. The former, to which our BOW approach belongs, needs a dictionary or database with information trained previously. Bag of binary words [
24] is one of the most important exponents of the offline type. It was used, for example, in ORB-SLAM [
25] and has been tested more recently in [
26].
Given a training set of BOF descriptors, a codebook needs to be created. The codebook is an array of centroids
. To represent a BOF descriptor
as a word, we calculate the distance of each component
with each centroid
and select the closest. So, the vector
is formed. Finally, the number
counts the times that the centroid
appears in
; the result is a
k length vector
, which represents the frequency that each word
has in the BOF descriptor. All this process is summarized in the map:
3.4. Loop-Closing Detection
We followed the method described in [
27] to perform loop closing, under two constraints: first, we assume that the point clouds of visited frames are already stored; second, we use a simple bag of words dictionary without a tree structure. In other words, we apply k-means and not hierarchical k-means for its creation in order to keep computational complexity as low as possible.
The BOW descriptor obtained with Equation (
4) needs to be described in Tf-idf representation with the following map:
The vector of weights
is obtained in the training phase by:
where
is the number of BOF descriptors in the training set, and
counts those that contain the word
.
The applied distance in the whole process is the
-norm. The justification of relying on this metric comes from the results reported in [
28], where it outperformed normalization. The BOW vector associated to the frames
i and
N are compared using the function:
where
N represents the label of the current frame. In order to normalize this function and given that the object of study is sequences of images, the following variation is used as a similitude score:
where
is an integer interval such that the frame
passes one second before the current frame
N.
If is less than , the frame is discarded; otherwise, the frame that maximizes is inspected. A time consistency check is carried out for this maximum, which consists of the replication of these steps for frames , validating that the corresponding maxima are indeed closed enough. Two thresholds and are selected. If , the frame is discarded. If , the frame is accepted as a loop-closing one. However, if is in the range (, ), a geometric verification using RANSAC over the point clouds corresponding to the frames i and N is needed.
3.5. Experimental Setup
We conducted experiments on a dataset of indoor scenes captured using an RGB-D camera. The dataset contains several scenes with different illumination conditions as well as distinct object configurations. We compared the performance of our proposed method with that of traditional image classification methods such as SIFT and GIST [
29].
In the context of scene classification, we trained two models: the first one relies on BOF for the feature extraction stage, whereas the second is based on SIFT. Both models use BoVW and SVM for feature transformation and classification, respectively. For the purpose of this paper, we call the first method BOF-BoVW and the second SIFT-BoVW.
For the experiments, we used the Microsoft 7-Scenes dataset [
19], which consists of RGB-D sequences (recordings) in 7 different zones. Each zone has different sequences. The zones are Chess, Fire, Heads, Office, Pumpkin, RedKitchen, and Stairs.
Also, we performed tests sing the SUN RGB-D dataset with the same train–test split as in the original publication [
20]. The dataset consists of several thousands of images distributed along 19 labeled scenes; the split was chosen carefully by the authors in order to avoid the sparsity of the frames and allow a correct generalization (
Figure 4). Originally this dataset was tested using a GIST descriptor linked to a SVM. The stack of the GIST descriptors applied to RGB and depth improved the results. The best results were achieved with the use of the Places-CNN descriptor and an RBF-SVM.
We were interested in comparing our model using this dataset because it is based on an an SVM approach. This provided a direct metric to compare our results with the existing ones.
In order to prove the effectiveness of the scene classification in real conditions, we tested the BoVW-BOF method with our own robot platform, which has a camera (RGB-D realsense model D45)5. For the training phase, we recorded 7 scenes in our laboratory: office_1, office_2, laboratory_1, corridor_1, corridor_2, corridor_3, and bathrooms. We recorded the depth and RGB images and collected them to create the OfficeBot TourPath (OBTP) dataset.
For the loop detection experiments, we concentrated on the chess sequences in the Microsoft 7-Scenes dataset. We followed the split for the training and testing sets as described in [
30]. For the training set, we created a code book of 1024 words based on the BOFs descriptor extracted from the sequences; for testing, we used the third sequence. Then, we put each word in a TF-IDF representation and compared the similarity of the current frame with the one
N frames behind, as stated in
Section 3.4. After temporary verification, we fixed the thresholds
and
as in [
27] in order to determine if a loop candidate is approved or discarded.
In the next list, we describe the parameters that modulate the behavior of the algorithm:
: Upper threshold that allows us to determine if a loop is accepted.
: Lower threshold that allows us to determine if a loop is discarded.
N: If the current keyframe is in position M, then the keyframe is used to calculate the normalization factor .
: The threshold that the normalizer has to exceed in order to be accepted.
TC req: Number of keyframes adjacent to the current frame that are required to declare it as valid in the temporary consistency check.
TC: Number of keyframes in which the temporary consistency check runs.
: Threshold that represents the maximum difference allowed between the index that maximizes the normalized scores of the frames adjacent to the current one.
keyframes: The number of frames that are considered in evaluation. It is the result of a homogeneous division of the number of total frames.
The next list contains the values returned as output by the algorithm [
Candidates: Number of keyframes that pass the upper threshold.
Approved: Number of candidates that pass the time consistency check.
Discarded: Number of keyframes that stay below the threshold.
4. Results
4.1. Results for Scene Classification on Microsoft 7-Scenes Datasets
We first evaluated BOF-BoVW and SIFT-BoVW using the hold-out method, with 75% training data and 25% test data, from a single sequence per class.
In the classification stage and using cross-validation, we found that the optimal classifier parameters are
with an RBF kernel for
BOF-BoVW and
with a linear kernel for
BOF-BoVW.
Figure 5 shows the confusion matrices resulting for the parameters mentioned.
Table 2 shows that we observed an accuracy of 99% with our proposed method, almost reaching the accuracy of SIFT-BoVW, which has just one mismatching frame. This scenario has applications for a robot that navigates in the same building.
In the next stage,
BOF-BoVW was evaluated using a sequence of frames different from the ones present in the training set as testing data. This scenario is applicable to robots that navigate in unknown buildings. In
Figure 6, we show that our method decays to 34% accuracy, where the heads scene is the one with the best performance metrics. It can be observed that the three blocks in the central diagonal of
Figure 6a are consistent. Conversely, SIFT-BoVW maintains high accuracy, where the decrease is justified by the unbalanced stairs class. From this, the diagonal in
Figure 6b only fails in the last square. The
Table 3 shows an accuracy of 34% for
BOF-BoVW and 85% for
SIFT-BoVW.
4.2. Results for the SUN RGB-D Dataset
The SUN RGB-D dataset allows testing the generalization capabilities of SVM models. To achieve the best results with the BOF descriptor, we set the number of layers to 20 in the point cloud. Each layer produces a binary image of 300 × 300 pixels, from which we obtain the contours. We requested that the area of the contour was at least one percent of the total binary image area. With this configuration, 164,972 vectors were obtained, leading to 5285 BOF descriptors, one for each frame of the training set.
The SUN RGB-D dataset is known for presenting several challenges, a fact that is confirmed by the confusion matrices displayed in
Figure 7. It can be observed that the matrices are disperse, and just some squares of the diagonal are colored, indicating the difficulty of achieving low error. Along this line, the class of furniture store objects is the one with higgest F1 score. In
Table 4, we show that in some scenes, such as some from the study space class, the SIFT-BoVW model achieves better results, whereas in others classes, such as the rest space one, the BOF-BoVW model obtains the best results. In terms of expected accuracy, both methods offer similar results.
Originally, in [
20], the SUNRGB dataset was evaluated with a configuration of the GIST descriptor and an SVM as the classifier. In addition, the color and deep information were included in the evaluation. In
Table 5, we compare our implementations with the traditional approaches. Of particular interest is the observation that BOF-BovW performs better than GIST with either RGB or depth information alone. From this, we conjecture that the use of both color and depth information is needed to improve the GIST performance.
The deep analysis of the performance of our model was based on the impact of the number of BOF descriptors per frame. We varied it from three to twenty in order to examine the changes in the classification metrics.
4.3. Results for Real Usage Conditions
We tested the the BoVW-BOF approach with our mobile robot platform; we built our own OBTP dataset (
Table 1). For the training phase, we considered seven scenes; a total of 31,000 BOF descriptors were extracted from 1570 depth images. In the testing phase, the robot was launched on a different day with the same illumination conditions, and 920 frames were evaluated.
Figure 8 shows two different confusion matrices. We noticed that the corridors were similar scenes in terms of the absence of characteristic objects. Also, the office_2 scenes had less training frames than the rest. So, in
Figure 8b, we restrict our scenes to the those determinants resulting in an improvement in accuracy of up to 86% (
Table 6).
In order to check the efficiency and performance of the described method, an ROC curve was generated (
Figure 9) on the OBTP dataset. It can be observed that most of the scenes are satisfactorily classified, except for the corridor_1 scene. The main reason for this discrepancy is the significant imbalance in the number of frames in that scene compared to the remaining ones. For the latter scenes, the area under the curve (AUC) is above 0.92.
4.4. Results for Time Performance
The main objective of using BOF over SIFT is to reduce the computational complexity associated with the whole process, which includes memory (hardware) and processing time, to enable real-time recognition on single-board computers. To compare the consumption of computational resources, a comparison is made between the use of BOF and SIFT descriptors.
Our results are presented from two aspects: CPU usage time and a stage that we call “real time”. The CPU time combines user and kernel times and accounts for each core in multi-core processors. The real-time aspect refers to the total elapsed time from the start to the end of the process, not considering individual core times. In multi-core processors, these measurements can differ, especially if processes run in parallel, which may influence the actual time in order to make it shorter than the CPU time.
The processes evaluated in
Table 7 are
Extraction of descriptors from a frame, which is the average value obtained from 10 runs on the same frame is considered as the relevant quantity.
Extraction of descriptors from multiple frames, where 1000 frames were processed.
Generation of a visual word vocabulary, consisting of 1024 words. For BOF-BoVW, a three-layer case was computed on 34,000 samples. BOF-BoVW 20-layer case was computed over 190,000 samples, and the SIFT-BoVW case was computed on 150,000 samples.
Further transformation to a BoVW TF-EDF representation using the 1024 words dictionary.
Training of the model using pre-defined parameters. SVM was trained using the parameters previously mentioned.
Classification: quantification of the classification performance over 1625 samples using the SVM model trained in point 5.
Computing the total representation time. This is the sum of the results from points 2 and 4.
Computing of the total offline phase. It is defined as the sum of the results from points 3 and 5.
Computing of the total online phase, which consists of the sum of the results from points 2, 4, and 6.
Table 7.
Comparison of time performance results.
Table 7.
Comparison of time performance results.
Process No. | BOF-BoVW 3 Layers | BOF-BoVW 20 Layers | SIFT-BoVW |
---|
CPU (s)
|
Real (s)
|
CPU (s)
|
Real (s)
|
CPU (s)
|
Real (s)
|
---|
1 | 0.31 | 0.30 | 0.46 | 0.39 | 0.32 | 0.26 |
2 | 294.87 | 295.68 | 392.12 | 353.83 | 486.92 | 292.70 |
3 | 73.00 | 18.39 | 9710.73 | 2871.72 | 3962.71 | 1354.93 |
4 | 107.59 | 27.93 | 148.17 | 38.52 | 533.62 | 137.86 |
5 | 93.78 | 94.37 | 100.78 | 100.69 | 16.23 | 16.33 |
6 | 27.81 | 27.97 | 31.17 | 31.36 | 6.20 | 6.20 |
7 | 402.46 | 323.62 | 540.30 | 392.35 | 1020.54 | 430.55 |
8 | 166.78 | 112.76 | 9811.51 | 2972.42 | 3978.94 | 1371.25 |
9 | 430.27 | 351.59 | 571.47 | 423.71 | 1026.74 | 436.75 |
In order to better understand the comparison of BOF-BoVW and SIFT-BoVW, we present the percentage increases for the listed cases
Table 8. Increases are computed using the equation
, where
I is the percentage increase,
the final value, and
the initial value.
Increase B-S 3 means the percentage increase using BOF-BoVW with 3 layers as the initial value and SIFT-BoVW as the final value. The same reasoning is followed for
Increase B-S 20, but relying on a BOF-BoVW with 20 layers as the initial value.
In terms of memory usage, the results for the sequence 01 train split of the Microsoft 7-Scenes dataset are shown in
Table 9. The most relevant result can be observed in the first row, where SIFT descriptors need 1.9 GB. However, BOF descriptors with three layers need 49.4 MB, which translates into an increase of 3746% of storage needed. Using our heaviest 20 layers BOF representation leads to an increase of 593%. Maintaining descriptors over time is important if an implementation in a SLAM system is sought, due to the importance of reusing information from previous frames already visited, in order to speed up tasks such as loop closure detection. We observed that the BOW TF-IDF representations in both descriptors is almost identical, which can be explained by the fact that the model mainly depends on the codebook and the numbers of words in it. The other files that need to be stored are the codebook and model trained, and these remain in the megabyte scale in both cases.
4.5. Results of Loop Detection
In
Table 10, we display the results of the loop closure implementation. If we modify the parameter corresponding to the temporary consistency check (
, TC req, TC), the approved rate is doubled, as shown in
Figure 10b, which is contrasted with what is displayed in
Figure 10a. The change in thresholds
and
does not have a significant impact on the discarded rate parameter, and just seven loops more are approved in
Figure 10b,c.
Finally, we can also augment the gap between keyframes, which leads to a gain in processing speed, at the cost of reduced resolution. The lack of candidates and approved frames in
Figure 10e is explained by the fact that we set the parameter keyframe to every two normal frames instead of one, and we did not adjust the remaining parameters to stay proportional with this new distribution of keyframes. This is displayed in
Figure 10f. Despite having a lower value for the keyframes parameter, we achieved similar rates of approval and discard by means of tuning the relevant parameters.
The manner in which we implemented the loop-closing detection procedures is derived from counting with a bag of visual words representation for the scene classification phase. However, the fern approach in [
30] seems to be adaptable to our descriptor in the following way: each BOF descriptor has 180 entries, so we can set 180 thresholds
uniformly sampled and create a new binary vector, which contains a one if the corresponding BOF entry passes the threshold and zero otherwise.
In order to merge the results obtained in both parts, the classification and loop detection stages, a dataset needs to meet two requirements: to be divided in scenes and to contain a path that passes by those scenes. In this way, a semantic verification step immediately before the time consistency stage can be implemented in order to use this semantic information.
5. Conclusions
Scene recognition and classification are open problems in the robotics, vision, and pattern recognition fields. In this paper, we described a novel method able to cope with complex scenes at the time that keeps computational complexity low. Our method achieves performance comparable to that of more demanding architectures. The recognition and classification model we developed achieves performance that is comparable to that of other relevant models in a time with a significantly lower computing demand.
The main purpose of the BOF descriptors is to be lightweight, that is, to reduce computational complexity in both space (memory use and hardware resources) and processing time. Using a relatively shallow architecture of only three layers and configuring the online processes (descriptor extraction, BoW representation, and classification) took 596 s less than the one with SIFT and was 2.38× faster, which is an important result because of the calculations that the onboard machine of the robot must complete. Furthermore, the offline processes (codebook generation and model training) also are more than 20 times faster in CPU time with the three-layer configuration. This opens the possibility of considering the implementation of a training phase on board to adjust the models trained offline.
The best scene recognition results were achieved with a configuration of 20 layers per frame. The results are comparable to those obtained with SIFT-based models, at least on the the two datasets we considered here. Also, we implemented an efficient completed loop-closing module. Furthermore, our method was able to rely on semantic information derived from the scenes. A particularly relevant next step in our research is the implementation of this module in a lightweight semantic SLAM system.
We presented the results of our approach in several tables and figures in
Section 4, which are comparable to those obtained by more popular methods. At the same time, the significantly less computation needed by our approach was proven in the corresponding analyses. We consider this latter attribute to be one of the main contributions of our work.
An additional advantage of our method is that the number of descriptors and their size take up less space in the CPU’s RAM. While SIFT-BoVW uses 1.9 GB, BOF-BoVW (20 layers) requires only 274 MB (
Table 9). On some small-form-factor computers, it would be challenging to load the operating system and run the algorithm with SIFT; however, using the BOF descriptor for scene classification overcomes this issue. Remember that the longer the autonomous navigation journey, the more descriptors are needed for both SIFT and BOF.
Future Work
A natural follow-up experiment involves testing the entire SLAM algorithm on the two datasets descxribed in this paper. Moreover, our model can be embedded in a robot with omnidirectional wheels to confirm that the point cloud capture remains unaffected by potential camera warping. Given the robot’s primarily smooth horizontal movement and the camera’s fixed position, the point cloud is anticipated to maintain a consistent distance from the floor to the sensor without any tilt.
Currently, classification methods using deep learning are very competitive tools and reach extensive generalization ranges. So, we will seek to move away from classification using SVM and opt for a deep learning model that classifies the BOFs of each layer of each frame of each scene. Unlike the images to be classified with these algorithms, in this method, the vectors are made up of 180 values. This enables reductions in the number of inputs in convolutional networks and in the number of parameters.
As a possible extension of our work, a different alternative is to consider descriptors other than BOF in order to consider the placement and sequence of each point in the depth matrix. This aims to bypass the projection of points onto the layers.