A Robust and Lightweight Loop Closure Detection Approach for Challenging Environments

Shi, Yuan; Li, Rui; Shi, Yingjing; Liang, Shaofeng

doi:10.3390/drones8070322

Open AccessArticle

A Robust and Lightweight Loop Closure Detection Approach for Challenging Environments

School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(7), 322; https://doi.org/10.3390/drones8070322

Submission received: 16 May 2024 / Revised: 10 July 2024 / Accepted: 11 July 2024 / Published: 12 July 2024

Download

Browse Figures

Versions Notes

Abstract

:

Loop closure detection is crucial for simultaneous localization and mapping (SLAM), as it can effectively correct the accumulated errors. Complex scenarios put forward high requirements on the robustness of loop closure detection. Traditional feature-based loop closure detection methods often fail to meet these challenges. To solve this problem, this paper proposes a robust and efficient deep-learning-based loop closure detection approach. We employ MixVPR to extract global descriptors from keyframes and construct a global descriptor database. For local feature extraction, SuperPoint is utilized. Then, the constructed global descriptor database is used to find the loop frame candidates, and LightGlue is subsequently used to match the most similar loop frame and current keyframe with the local features. After matching, the relative pose can be computed. Our approach is first evaluated on several public datasets, and the results prove that our approach is highly robust to complex environments. The proposed approach is further validated on a real-world dataset collected by a drone and achieves accurate performance and shows good robustness in challenging conditions. Additionally, an analysis of time and memory costs is also conducted and proves that our approach can maintain accuracy and have satisfactory real-time performance as well.

Keywords:

loop closure detection; UAV system; visual SLAM; deep learning

1. Introduction

With the research and development of autonomous driving and drone swarm systems, the application scenarios for unmanned ground vehicles (UGVs) and unmanned aerial vehicles (UAVs) have gradually expanded from the laboratory to complex scenarios in the real world. Visual SLAM technology has been widely applied on UAVs/UGVs to complete series of complex tasks in the real world. The complexity of real-world scenarios is mainly reflected in factors such as large spatial ranges, dynamic scene changes, and variations in lighting. Loop closure detection plays a crucial role in reducing cumulative errors during the completion of complex tasks [1]. Visual loop closure detection mainly relies on the associations between visual features, but large viewpoint changes, variations in lighting, occlusions, blurriness, and lack of texture pose particularly challenging factors for 2D-to-2D data association [2]. Therefore, such scenarios impose high demands on the robustness of visual SLAM loop closure detection, which is a significant challenge in SLAM [3]. To address the challenges faced by visual SLAM in complex real-world scenarios, we conduct research on robust loop closure detection.

Most existing loop closure detection methods use traditional visual bag-of-words (BoW) for place recognition, such as VINS-Fusion [4], ORB-SLAM3 [5], and Kimera [6]. The BoW typically requires a training stage wherein visual features are extracted from training images, such as SIFT [7], SURF [8], BRIEF [9], ORB [10], etc., and these features are then clustered to generate a vocabulary, which is a set of quantized visual features known as visual words [11]. However, traditional feature-based BoW methods are invalid in long-term SLAM applications where the scene changes over time. Moreover, the static vocabularies used in pre-trained BoW methods are not well-suited for long-term place recognition because the scenes may differ from the training images [12].

With the rapid development of convolutional neural networks (CNNs), many works have shown robust performance on tasks such as image retrieval and feature matching, especially under extremely challenging conditions [13]. In terms of image retrieval, NetVLAD [14], and DenseVLAD [15] have achieved good results. For image matching, the combination of SuperPoint [16] and SuperGlue [2] has been proven to perform robust feature matching. However, applying deep-learning-based methods to practical SLAM systems still presents challenges. With better and more complex models emerging, ensuring real-time model inference is a crucial problem. If the inference time of a model is too long, it may not be able to meet the real-time requirements of the SLAM system.

To address the problems mentioned above, this paper proposes a robust and lightweight loop closure detection approach for challenging environments that is based on a state-of-the-art deep learning approach. Building upon the success of the combination of SuperPoint and SuperGlue in the SLAM field, in our proposed approach, SuperPoint is used for the feature extraction component to enhance the robustness of the extracted feature points, and a more lightweight and effective method LightGlue [17] is employed in the feature matching part. For place recognition, we utilize MixVPR [18], which is a method based on multi-layer perceptrons (MLPs) that can more stably represent the global features of images. Furthermore, we deploy the models utilized in our approach using the TensorRT framework to guarantee the real-time performance of loop closure detection.

The main contributions of this paper are as follows:

A novel deep loop closure detection method is proposed to address challenging scenarios; it incorporates the latest methods for both feature matching and place recognition into the visual SLAM framework. Performance evaluations are conducted on multiple datasets and achieve promising results.
The inference of the models is accelerated using TensorRT. Pre-processing and post-processing by the models are optimized based on the concept of parallel programming, ensuring the real-time performance of the loop closure detection approach.
Our code is open-source and is available at https://github.com/kajo-kurisu/D_VINS (accessed on 24 April 2024).

The structure of this paper is as follows: In Section 2, we review the existing image matching and loop closure detection methods. In Section 3, we provide a detailed introduction to the proposed approach. Section 4 describes the implementation details of the algorithm. In Section 5, we test the proposed approach on several datasets and compare it with other approaches. Finally, in Section 6, we present the conclusions drawn from our research.

2. Related Works

2.1. Robust Feature Matching

In SLAM, the correspondence between points in images is vital for estimating 3D structures and camera poses. This correspondence is typically estimated through the matching of local features: a process known as data association. The key to this step is to identify the best features that are invariant to orientation, angle, and contrast in images, as this results in higher precision and faster outcomes for data association [19]. In visual SLAM, feature point extraction and matching methods are mainly divided into two categories: traditional methods and deep methods. Traditional methods include algorithms such as SIFT, SURF, and ORB. The SIFT and SURF algorithms cannot meet the real-time requirements of SLAM. The ORB algorithm achieves real-time performance, but its matching accuracy is not as high as SIFT, and it is significantly affected by changes in lighting conditions.

Recent studies on deep learning for image matching have demonstrated the advanced capabilities of deep local features. Representative works include SuperPoint [16], R2D2 [20], and DISK [21]. SuperPoint is a local feature extraction method based on a fully convolutional neural network (FCNN) that is self-supervised. On the HPatches [22] dataset, SuperPoint’s performance in homography estimation significantly outperforms LIFT [23] and ORB, and in some cases, it can even surpass SIFT’s performance when using various thresholds of correctness. R2D2 is a method that jointly learns the description and detection of feature points and is capable of finding reliable and repeatable keypoints. DISK is a method based on reinforcement learning that differs from the previously mentioned approaches by providing a very dense set of feature points.

At the same time, the performance of feature matching algorithms has also been significantly improved. Representative works include SuperGlue [2], which is a method for feature matching based on attention graph neural networks that robustly matches the features extracted from two images. Another work is LoFTR [24], which is a transformer-based [25] method that directly matches two images without feature extraction (detect-free) and outputs matching pairs. Experiments using SuperGlue have shown that when combined with SuperPoint, it achieves leading performance in pose estimation tasks in challenging indoor and outdoor scenes. Some works have integrated these methods into SLAM, such as [26,27], which apply them to visual odometry by using SuperPoint as the front-end feature extraction method and SuperGlue for feature matching to achieve feature tracking. SP-LOOP [28] utilizes this combination for loop closure detection: performing local feature matching between the current frame and loop frame. In [29], the authors not only apply this combination to visual odometry but also to loop closure detection.

Recently, LightGlue [17], which is a matcher that combines attention mechanisms with insight about the matching problem and recent innovations in transformers, has been proposed. It has the ability to stop at earlier layers based on the amount of visual overlap, appearance changes, or discriminative information. Compared to SuperGlue, LightGlue is lighter and more accurate, and it also demonstrates good compatibility with SuperPoint. Experimental results using LightGlue show that the combination of SuperPoint and LightGlue almost reaches the state-of-the-art (SOTA) level in terms of accuracy. Therefore, this combination has great potential for application in SLAM. Inspired by the above methods, our work focuses on designing a combination based on SuperPoint and LightGlue in order to achieve a robust and lightweight feature matching method that is then applied to the field of SLAM.

2.2. Robust Loop Closure Detection

In order to effectively perform SLAM tasks in the real world, the loop closure detection approach must be robust enough to overcome disturbances in the real environment, such as significant changes in viewpoint and illumination, occlusions, blurring, and lack of texture.

With the proposal of the bag of words (BoW), many works have used the BoW for loop closure detection. DBoW [11] utilizes traditional handcrafted features (such as SIFT [7], SURF [8], ORB [10], etc.) to generate a BoW for training; then it constructs a visual vocabulary tree, which is finally used for loop closure detection. Systems such as VINS-Fusion [4], Kimera [6], ORB-SLAM2 [30], and ORB-SLAM3 [5] all utilize DBoW2. SP-LOOP [28] combines the CNN-based descriptor SuperPoint with BoW, using SuperPoint to train the vocabulary tree and apply it to loop closure detection. However, a main issue with BoW methods is the requirement for a training phase. Depending on the number of descriptors and clustering techniques used, this phase can be quite time-consuming. Higher descriptor counts generally result in better performance but require more training time. Additionally, various BoW-based methods do not perform ideally under varying lighting conditions and viewpoints. Moreover, the static vocabulary used in pre-trained BoW methods may not be suitable for long-term place recognition, as the scene can differ from the training images.

There is another method for loop closure detection based on vectors of locally aggregated descriptors (VLADs) [31,32]. With the development of deep learning, many CNN variants based on VLADs have emerged and have been used in various excellent projects. Omni-Swarm [33] uses MobileNetVLAD [34] to extract global descriptors for loop closure detection in drone swarms. PanoNetVLAD [35] shows that the NetVLAD-based method is more effective at loop closure detection in complex scenes than DBoW2. ESA-VLAD [36] integrates a second-order attention module with EfficientNetB0 as the backbone network for global feature extraction and combines it with NetVLAD to form a lighter and more effective network for loop closure detection. Ref. [37] achieves precise hierarchical localization using HF-Net.

However, while methods based on the aggregation of local features can address some issues of viewpoint transformation, they are easily compromised under severe scene changes, such as drastic illumination variations and seasonal changes. Recently, MixVPR [18], which is a new holistic feature aggregation technique based on MLPs, has shown the ability to robustly describe the global features of images. It achieves this through a stack of isotropic blocks referred to as a FeatureMixer, which consists solely of multi-layer perceptrons (MLPs). This method based on MLPs is more effective at focusing on the global features of an image compared to previous methods based on local feature aggregation, which has been demonstrated in MixVPR’s experiments. Moreover, MixVPR is lighter than other approaches, which aligns with the requirements of SLAM systems. Inspired by these approaches, our research aims to create a more robust and lightweight loop closure detection method by combining holistic feature aggregation methods based on MLPs with vector retrieval methods based on k-nearest neighbors (kNNs).

3. Methods

In this section, we first provide an overview of the system, followed by an introduction of each component of the system, including feature extraction, keyframe retrieval, feature matching, and relative pose estimation.

3.1. System Overview

The loop closure detection approach we designed is scalable and can be easily applied under the keyframe-based SLAM framework. The pipeline is shown in Figure 1.

When the VIO system receives a new keyframe, we first calculate its global descriptors, local descriptors, and local features. The global descriptor is obtained through the MixVPR network and is further used to construct a global descriptor database for keyframe retrieval. To ensure the accuracy of keyframe retrieval, we use the inner product between global descriptors to evaluate the similarity between keyframes and sort them, selecting the top k frames with the highest similarity as candidate frames. To accelerate the retrieval process of the global descriptor database, we employ the Faiss library [38], which is an efficient library for similarity search and dense vector clustering.

Local features and local descriptors are image features that are extracted and encoded by the SuperPoint network and are used for feature matching after the completion of keyframe retrieval. We use the LightGlue network to match the local features between the current frame and the final loop frame. Then, using this matching information, we employ the Perspective-n-Point (PnP) algorithm [39] to calculate the relative pose between the current frame and the loop closure candidate frame. Finally, this loop information is used as a constraint in the pose graph optimization (PGO) process to generate more accurate pose estimation results.

3.2. Feature Extraction

The process of feature extraction is shown in Figure 2. Whenever a new keyframe is received, if the image is in color, it is initially converted to grayscale before being processed by the SuperPoint network.

The image first passes through the encoder, the primary function of which is to map the input image

I \in R^{H \times W}

to an intermediate tensor

B \in R^{H_{c} \times W_{c} \times F}

with smaller spatial dimensions and larger channel depth through max-pooling layers, where

H_{c} = H / 8

,

W_{c} = W / 8

, and

F > 1

. This intermediate tensor is then used by the interest point decoder to compute the feature points, while the descriptor decoder also calculates the local descriptors corresponding to the feature points. Through the inference process of the SuperPoint network, we obtain the keypoints, local descriptors, and corresponding scores.

Additionally, the feature points tracked by the VIO system can also serve as the local features of the image. To integrate these features and ensure consistency with the features output by the original SuperPoint network, we modified the original SuperPoint network by retaining its shared encoder and descriptor decoder components. We input the keyframe images and feature points tracked by VIO into the adjusted network to obtain the SuperPoint descriptors for these tracked feature points. This modification enhances the scalability of our loop closure approach, making it compatible with feature points extracted from various visual front-ends.

3.3. Keyframe Retrival

As previously mentioned, we design our keyframe retrieval module based on the MixVPR network. Specifically, we first use MixVPR to extract global descriptors from the images. For an input image

I

with a format of

C \times H \times W

, it first passes through a CNN backbone network. MixVPR extracts the holistic feature map

F \in R^{c \times h \times w}

from the middle layer of this backbone network, where F can be represented as

F = CNN (I)

. Then, this three-dimensional tensor is divided into a set of 2D feature maps of size

h \times w

:

F = {X^{i}}, i = {1, \dots, c}

(1)

Here,

X^{i}

corresponds to the

i^{t h}

activation map within F, which contains a certain amount of image information. Then, the 2D feature

X^{i}

is further reshaped into a 1D representation (flattening), resulting in a flattened feature map

F \in R^{c \times n}

, where

n = h \times w

. Subsequently, these flattened features are input to a set of L MLP blocks (Feature Mixer) with identical structures. It is the fully connected structure that allows the network to have a complete receptive field, enabling it to better focus on the global features of the entire image. After the MLP layers are processed, two mapping operations, depth-wise projection and row-wise projection, are performed to reduce the dimensionality of the obtained global descriptor to

(d, r)

. After one more flattening and normalization operation, we finally obtain a

d \times r

dimensional global descriptor with good global feature representation capabilities. The network structure of MixVPR is shown in Figure 3.

However, MixVPR supplies only the descriptor itself and does not directly support loop detection functionality. To fill this gap, we designed a real-time global database based on global descriptors. The process of this part is shown in Figure 4.

In this database, each keyframe

k f_{i}

has a corresponding global descriptor

g d i

, where i represents the index of the keyframe, and n is the total number of keyframes. Therefore, our database is represented as:

D B = {g d_{i}}, i = {1, \dots, n}

(2)

Whenever a new keyframe is obtained, we first extract its global descriptor and then integrate it into the database

D B

. Subsequently, we perform a similarity query for this new global descriptor to detect potential loops. To reduce the impact of recent data on query results, we introduce a window threshold

t h r e s

. Thus, the query set Q is defined as:

Q = {g d i | i < n - t h r e s}

(3)

This ensures that the query set only contains keyframe descriptors that are far enough from the current frame, thereby improving the accuracy and efficiency of detection.

The retrieval of loop closure keyframes is achieved through the global descriptor database mentioned earlier. The process of this part is shown in Figure 5. Whenever a new keyframe

K F

arrives, we select a subset Q of vectors from the database for the similarity search, and we sort these vectors by similarity to identify the top k frames with the highest similarity as loop candidate frames; these are denoted as

{k f_{1}, k f_{2}, \dots k f_{k}}

.

To select the best loop closure candidate frames, we set two similarity thresholds:

t h 1

and

t h 2

. If the highest similarity among the top k candidate frames is below

t h 1

, it is concluded that the new keyframe

K F

does not have a corresponding loop frame. In this case, we only add its global descriptor to the database and do not perform further loop closure detection operations. On the other hand, if the highest similarity among the top k candidate frames exceeds

t h 1

, we continue searching and further filter out frames with similarities above

t h 2

from the candidate frames. The earliest appearing keyframe is then selected as the loop closure candidate, as earlier keyframes often have more-reliable pose estimation. Through this process, we ultimately determine the best loop closure candidate frame for the current keyframe.

3.4. Feature Matching and Relative Pose Estimation

After identifying the best loop closure frame, the next step is to perform local feature matching between the current frame and the loop frame. We design the feature matching and pose computation module of our approach based on LightGlue. Specifically, we first employ LightGlue to process the previously mentioned local features and their descriptors that were extracted by SuperPoint. The procedure for this step is illustrated in Figure 6.

The first half of the LightGlue network is similar to SuperGlue. First, the features and local descriptors of the two images are input into the network. After self-attention and cross-attention processing, correspondence prediction is carried out. This prediction evaluates all pairs of points for matching and similarity. The matching prediction score is used to assess whether a point can be matched with others, while similarity prediction assesses the reliability of the matching relationship between points. The pair of points

(i, j)

that is predicted to be matchable and has the highest similarity score is considered the correct matching point.

Unlike SuperGlue, LightGlue has an adaptive exit mechanism and a pruning mechanism, which can reduce unnecessary computations in the model and speed up inference time. Due to the enhancement of the visual features in the backbone, if the input image is relatively simple, i.e., it has high field-of-view overlap and minimal appearance changes, then the matching predictions in the early layers tend to be reliable and are essentially the same as those in the later layers. At this point, inference can be stopped to save time. The matching prediction is as follows:

c_{i} = Sigmoid (MLP (x_{i})) \in [0, 1]

(4)

Here,

M L P (\cdot)

is part of each attention unit, and

x_{i}^{I} \in R^{d}

represents the state corresponding to feature i in image I. The greater the value of this prediction, the more similar the predictions for feature i are in the earlier and later layers (these features can be either matchable or non-matchable). After each layer is processed, an exit decision is made. When a sufficient number of reliable points are identified, LightGlue exits the inference and presents the current result. Otherwise, it performs point pruning by discarding points with high reliability but marked as non-matchable, and then it continues inference in the next layer.

Once network inference is complete, we utilize the set

M = \{\begin{matrix} P_{i}, P_{j} \end{matrix}\}

to represent the matching pairs obtained from LightGlue’s inference. Within this set, we filter out matching pairs with scores above threshold

c l

, but we refrain from setting this threshold too high because we believe that the reprojection error handled by the subsequent RANSAC algorithm [40] is more critical than the precision of the network’s output. When the number of matching pairs exceeds a predetermined threshold N-loop, we use the PnP algorithm to determine the relative transformation T between the two frames. During this computation, the VIO tracking points of the current frame serve as the input 3D points, while the SuperPoint feature points and tracking points of the loop frame serve as the input 2D points. Finally, the relative pose transformation T obtained through the PnP calculation is applied as a loop closure constraint in the PGO to optimize the pose estimation.

4. Implement Details

We replaced the loop closure detection module VINS-Loop in VINS-Fusion with the proposed module and deployed it using C++ and ROS. For the parts of the algorithm related to neural networks, the original SuperPoint, LightGlue, and MixVPR were used in Python (v3.8.18). To facilitate deployment in C++, the C++ interface of Torch [41] is a convenient conversion method. However, in practical engineering scenarios involving GPUs, the inference speed based on the Torch framework is often slow, which can severely affect algorithm performance. Therefore, we chose to use TensorRT as our inference framework. Specifically, we first converted the PyTorch model of the mentioned networks into an ONNX model, simplified the ONNX model, and then converted it to an engine suitable for TensorRT inference. Considering the balance between precision and time, we chose FP16 as the quantization precision. Additionally, we optimized the pre-processing and post-processing of the models based on the concept of parallel programming to make them more efficient. The other parts of the algorithm are executed on the CPU. We will analyze the time and memory costs in Section 5.

5. Experiment

To demonstrate the robustness and accuracy of our approach, we deployed it on the VINS-Fusion framework and conducted experimental validations on both static and dynamic scene datasets. We also analyzed the time cost to prove the real-time performance of our approach. In our experiments, we used the absolute trajectory error (ATE) as the evaluation metric. ATE is used to evaluate the deviation between the estimated pose of the approach and the groundtruth at the same time, which can intuitively show the performance of the algorithm throughout the pose estimation process. However, the poses estimated by VI-SLAM and the groundtruth are usually not in the same coordinate system, so we align the two trajectories first. This alignment process calculates the transformation matrix

S \in S E (3)

that best fits the two trajectories by considering the temporal correspondence. Then, the estimated trajectory can be converted to the true trajectory’s coordinate system through this transformation matrix. Since the VI-SLAM evaluated in this paper has absolute scale information, the uncertainty of the scale does not need to be considered. Therefore, the ATE for the ith frame can be expressed as follows:

F_{i} : = Q_{i}^{- 1} {SP}_{i}

(5)

where

Q_{1}, . . ., Q_{n} \in S E (3)

represent the groundtruth poses, with the subscript indicating the timestamp (frame n).

P_{1}, . . ., P_{n} \in S E (3)

represent the poses estimated by the algorithm. When comparing to datasets, the timestamps of the estimated poses and true poses are aligned. To assess the overall performance of the algorithm as much as possible, the root mean squared error (RMSE) [42] is usually used to calculate the ATE:

RMSE (F_{1 : n}, Δ) : = {(\frac{1}{m} \sum_{i = 1}^{m} {∥trans (F_{i})∥}^{2})}^{\frac{1}{2}}

(6)

In our experiments, we used evo as our evaluation tool. Evo is an open-source evaluation toolkit that provides tools for measuring ATE, relative pose error (RPE), and other metrics, making it easy to use.

5.1. EuRoC

The EuRoC dataset [43] is a visual-inertial dataset collected by drones and is mainly static scenes. The dataset was collected in two kinds of scenarios: narrow room and factory. Our approach is experimentally validated on this dataset and compared with VINS-Loop and SP-LOOP because the front-ends of these works are consistent with our algorithm. The experimental results are presented in Table 1.

The experiments demonstrate that our approach is competitive in more challenging sequences. In the six sequences of V1 and V2, which are narrow indoor environments, our approach has a clear advantage in the relatively difficult V1_03 and V2_03 sequences. In V1_01 and V2_02, the error of our approach is only 0.02 higher than the smallest one. However, in the open factory scenarios, our algorithm outperforms both VINS-Loop and SP-LOOP in all sequences except MH_01. Figure 7 shows some matching illustrations from the EuRoC dataset, showing that our approach can stably match even when faced with large viewpoint changes.

5.2. KAIST

The KAIST urban dataset [44] was collected from highways in South Korean cities using a car driving under normal conditions and provides data from various sensors. The data used for experiments in this section consist of 10 Hz stereo images (with a resolution of 1280 × 560) and 100 Hz IMU data. This dataset is highly challenging for loop closure detection due to its complex environment, including high speeds, long routes, and prolonged durations. We conducted real-world tests of the proposed approach on this dataset and compare it with other approaches. The route diagrams for the sequences used in the experiments are illustrated in Figure 8.

The accuracy evaluation of these sequences from the KAIST dataset is presented in Table 2. We conducted multiple experiments on the utilized approaches and took the average values to obtain the experimental result.

It can be seen that our approach also demonstrates good performance on the KAIST urban dataset. In sequence 26, where loop closure scenarios are infrequent, both VINS-Loop and SP-LOOP fail to detect loops, while our approach can still reliably identify the same place and accurately execute the loop closure. The complexity of the scenes for sequences 27–28 has increased, yet our approach still significantly outperforms VINS-Loop and SP-LOOP. Sequences 38–39 are the most challenging in the KAIST visual-inertial dataset, and while our approach has lower precision than VINS-Loop and SP-LOOP on sequence 38, it still demonstrates good performance on sequence 39. In terms of the average error, our proposed approach’s is much lower compared to VINS-Loop and SP-LOOP, proving its robustness in the presence of numerous dynamic objects interfering with the urban scenes and achieving good results. Figure 9 shows some loop closure scenarios from the KAIST dataset using our approach.

5.3. Extremely Challenging Test

To further validate the performance of our approach in extreme scenarios, we conducted multi-session experiments on the 4Seasons dataset [45]. This dataset was collected by a vehicle and contains sequences with extreme changes such as seasonal variations and changes between morning and afternoon within the same scene, posing high demands on the robustness of loop detection algorithms. The three sequences we tested have environmental conditions of spring sunny, summer sunny, and winter snowy, respectively, with significant variations between them, making them highly challenging. The results of our multi-session experiments on the 4Seasons dataset are shown in Table 3.

It can be seen that our approach maintains accuracy under extreme stress tests, demonstrating sufficient robustness. Figure 10 shows some examples of loop detection for the extremely challenging test.

5.4. Real-World Experiments

Apart from public datasets, we conducted experiments in real-world outdoor scenarios. The platform we employed was a self-designed drone, as illustrated in Figure 11. The drone was equipped with an Intel Realsense D455 camera (Intel, Santa Clara, CA, USA) and an NVIDIA Jetson Orin Nano (NVIDIA, Santa Clara, CA, USA). The experimental scenarios were divided into simple and difficult sequences. In the simple sequence, the drone flew through the gaps between low-rise buildings with a limited field-of-view depth for a short route distance and a duration of approximately two minutes. In the difficult sequence, the drone navigated along a road on the spacious grounds in front of a library with a greater field-of-view depth, for a longer route distance, and for a duration of about 12 min. The overall trajectory diagrams for the simple and difficult sequences are shown in Figure 12.

In the experiments, we aligned the starting point of the drone’s takeoff with the ending point after landing. We then evaluated the overall performance of the approach based on the starting and ending positions of the trajectory estimated by the loop closure approach. We conducted experiments using VINS-Loop, SP-LOOP, and our approach and compared their performance. The specific experimental data are presented in Table 4.

In the outdoor simple sequence, due to the scarcity of loop closure scenes, VINS-Loop fails to find any loops. However, both SP-LOOP and our approach are capable of effectively identifying loop closures and obtaining accurate results. In this sequence, the precision of SP-LOOP is comparable to that of our approach, with our approach only being 0.007 m more precise than SP-LOOP. In the outdoor difficult sequence, where the trajectory length and duration significantly increase, VINS-Loop introduces considerable errors, and SP-LOOP is only able to find some accurate loop closures, while our approach handles the challenges of the difficult sequence well and performs accurate loop closure detection.

Figure 13 shows the performance of the three algorithms used in the simple sequence. Figure 13a shows that loop closure cannot be found using VINS-Loop. In Figure 13b, SP-LOOP is almost able to return to the starting point, with some deviation in the vertical direction. In Figure 13c, our approach can also return to the starting point, with some deviation in the forward direction.

Figure 14 shows the performances for the difficult sequence. In Figure 14a, the trajectory obtained using VINS-Loop shows significant drift from the starting point. In Figure 14b, the trajectory obtained using SP-LOOP shows significant improvement compared to VINS-Loop but still exhibits some drift from the starting point. In Figure 14c, our approach, compared to the previous two algorithms, is capable of returning to the starting point more accurately, demonstrating the best performance.

Figure 15 presents some loop closure examples from our approach, showing that it can stably detect accurate loop closures in real-world scenarios, especially in challenging environments.

5.5. Time and Memory Costs

Our platform is divided into two types: a desktop and an embedded system. The desktop is equipped with an RTX 2070 SUPER GPU and an Intel i7-8700 CPU, while the embedded system utilizes the NVIDIA Jetson Orin Nano 8 GB. On the desktop, the average inference time for our SuperPoint network on a 752 × 480 input image is about 6 ms, the average matching time for LightGlue is about 4 ms, and the global descriptor computation for MixVPR takes about 1.5 ms, with GPU memory usage of about 422 MB. On the CPU, to ensure the lightness of the algorithm, we use Faiss as the vector retrieval tool, with an average retrieval time of about 2 ms per global vector retrieval. Overall, the algorithm proposed in this paper can balance high-precision computation with low time consumption. The time cost and GPU memory cost are shown in Table 5 and Table 6.

On the NVIDIA Jetson Orin Nano 8GB platform, our approach can also ensure real-time performance. The primary factor affecting real-time performance on the Orin Nano is feature extraction, as we use SuperPoint to extract 512 feature points in our experiments. Despite this, our approach can still run at a frequency greater than 10 Hz for both the EuRoC and 4Seasons datasets. For the KAIST dataset, due to the higher image resolution, our approach operates at 7 Hz. However, since the image frame rate is only 10 fps, it still meets the real-time requirement. The time cost and GPU memory cost are shown in Table 7 and Table 8.

6. Discussion

This paper proposes a robust and lightweight loop closure detection approach suitable for challenging scenes in the real world with high dynamics and large-scale scenes. The approach employs SuperPoint for feature extraction of keyframes, then performs loop keyframe retrieval based on MixVPR, finally uses LightGlue for feature matching, and computes the relative pose using the PnP method. Moreover, to ensure real-time performance, we deploy the networks with TensorRT, enabling them to inference faster on the GPU.

We have conducted experimental validations on multiple datasets and compared our algorithm with others. We first tested on the EuRoC dataset, demonstrating better performance than VINS-Loop and SP-LOOP for general indoor scenes. Subsequently, we tested on the KAIST urban dataset, showing excellent performance for urban scenes. Moreover, we conducted extreme stress testing on the 4Seasons dataset, proving the algorithm’s robustness even under extreme seasonal changes. Experiments in the real world further confirm the robust performance of our approach. Finally, we performed a cost analysis, demonstrating that our approach is very lightweight and can ensure real-time performance on modern CPUs and GPUs.

Actually, two different feasible methods can be applied to further enhance the real-time performance of our approach, which will be included in our future research. In scenarios with high precision requirements, non-destructive acceleration optimization can be carried out. In scenarios with more relaxed precision requirements, the thresholds can be dynamically adjusted in our approach based on the platform’s computational power and precision requirements in order to achieve a balance between performance and accuracy.

Author Contributions

Conceptualization, Y.S. (Yuan Shi) and R.L.; methodology, Y.S. (Yuan Shi) and S.L.; software, Y.S. (Yuan Shi); validation, Y.S. (Yuan Shi), R.L. and S.L.; formal analysis, Y.S. (Yuan Shi); investigation, Y.S. (Yuan Shi); resources, R.L. and Y.S. (Yingjing Shi); data curation, Y.S.; writing—original draft preparation, Y.S. (Yuan Shi); writing—review and editing, Y.S. (Yuan Shi), S.L. and R.L.; visualization, Y.S. (Yuan Shi); supervision, R.L. and Y.S. (Yingjing Shi); project administration, R.L. and Y.S. (Yingjing Shi); funding acquisition, R.L. and Y.S. (Yingjing Shi). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under grant No. 61973055 and in part by the Natural Science Foundation of Sichuan Province of China under grant No. 2023NSFSC0511.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ding, C.; Ren, H.; Guo, Z.; Bi, M.; Man, C.; Wang, T.; Li, S.; Luo, S.; Zhang, R.; Yu, H. TT-LCD: Tensorized-Transformer based Loop Closure Detection for Robotic Visual SLAM on Edge. In Proceedings of the IEEE 2023 International Conference on Advanced Robotics and Mechatronics (ICARM), Sanya, China, 8–10 July 2023; pp. 166–172. [Google Scholar]
Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4938–4947. [Google Scholar]
Samadzadeh, A.; Nickabadi, A. Srvio: Super robust visual inertial odometry for dynamic environments and challenging loop-closure conditions. IEEE Trans. Robot. 2023, 39, 2878–2891. [Google Scholar] [CrossRef]
Qin, T.; Cao, S.; Pan, J.; Shen, S. A General Optimization-based Framework for Global Pose Estimation with Multiple Sensors. arXiv 2019, arXiv:1901.03642. [Google Scholar]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Rosinol, A.; Abate, M.; Chang, Y.; Carlone, L. Kimera: An open-source library for real-time metric-semantic localization and mapping. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 1689–1696. [Google Scholar]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Ess, A.; Tuytelaars, T.; Van Gool, L. Speeded-up robust features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
Calonder, M.; Lepetit, V.; Strecha, C.; Fua, P. Brief: Binary robust independent elementary features. In Proceedings of the Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; Proceedings, Part IV 11. Springer: Berlin/Heidelberg, Germany, 2010; pp. 778–792. [Google Scholar]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the IEEE 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Galvez-López, D.; Tardos, J.D. Bags of Binary Words for Fast Place Recognition in Image Sequences. IEEE Trans. Robot. 2012, 28, 1188–1197. [Google Scholar] [CrossRef]
Singh, G.; Wu, M.; Lam, S.K.; Minh, D.V. Hierarchical loop closure detection for long-term visual slam with semantic-geometric descriptors. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 2909–2916. [Google Scholar]
Wang, X.; Christie, M.; Marchand, E. Binary graph descriptor for robust relocalization on heterogeneous data. IEEE Robot. Autom. Lett. 2022, 7, 2008–2015. [Google Scholar] [CrossRef]
Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5297–5307. [Google Scholar]
Torii, A.; Arandjelovic, R.; Sivic, J.; Okutomi, M.; Pajdla, T. 24/7 place recognition by view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1808–1817. [Google Scholar]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
Lindenberger, P.; Sarlin, P.E.; Pollefeys, M. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 17627–17638. [Google Scholar]
Ali-Bey, A.; Chaib-Draa, B.; Giguere, P. Mixvpr: Feature mixing for visual place recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 2998–3007. [Google Scholar]
Kazerouni, I.A.; Fitzgerald, L.; Dooly, G.; Toal, D. A survey of state-of-the-art on visual SLAM. Expert Syst. Appl. 2022, 205, 117734. [Google Scholar] [CrossRef]
Revaud, J.; De Souza, C.; Humenberger, M.; Weinzaepfel, P. R2D2: Reliable and Repeatable Detector and Descriptor. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Tyszkiewicz, M.; Fua, P.; Trulls, E. DISK: Learning local features with policy gradient. Adv. Neural Inf. Process. Syst. 2020, 33, 14254–14265. [Google Scholar]
Balntas, V.; Lenc, K.; Vedaldi, A.; Mikolajczyk, K. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5173–5182. [Google Scholar]
Yi, K.M.; Trulls, E.; Lepetit, V.; Fua, P. Lift: Learned invariant feature transform. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VI 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 467–483. [Google Scholar]
Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8922–8931. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Fujimoto, S.; Matsunaga, N. Deep Feature-based RGB-D Odometry using SuperPoint and SuperGlue. Procedia Comput. Sci. 2023, 227, 1127–1134. [Google Scholar] [CrossRef]
Rao, S. SuperVO: A Monocular Visual Odometry based on Learned Feature Matching with GNN. In Proceedings of the 2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE), Guangzhou, China, 15–17 January 2021; pp. 18–26. [Google Scholar]
Wang, Y.; Xu, B.; Fan, W.; Xiang, C. A robust and efficient loop closure detection approach for hybrid ground/aerial vehicles. Drones 2023, 7, 135. [Google Scholar] [CrossRef]
Zhu, B.; Yu, A.; Hou, B.; Li, G.; Zhang, Y. A Novel Visual SLAM Based on Multiple Deep Neural Networks. Appl. Sci. 2023, 13, 9630. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardós, J.D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Jégou, H.; Perronnin, F.; Douze, M.; Sánchez, J.; Pérez, P.; Schmid, C. Aggregating local image descriptors into compact codes. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 1704–1716. [Google Scholar] [CrossRef] [PubMed]
Arandjelovic, R.; Zisserman, A. All about VLAD. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 1578–1585. [Google Scholar]
Xu, H.; Zhang, Y.; Zhou, B.; Wang, L.; Yao, X.; Meng, G.; Shen, S. Omni-swarm: A decentralized omnidirectional visual–inertial–uwb state estimation system for aerial swarms. IEEE Trans. Robot. 2022, 38, 3374–3394. [Google Scholar] [CrossRef]
Sarlin, P.E.; Debraine, F.; Dymczyk, M.; Siegwart, R.; Cadena, C. Leveraging deep visual descriptors for hierarchical efficient localization. In Proceedings of the Conference on Robot Learning PMLR, Zürich, Switzerland, 29–31 October 2018; pp. 456–465. [Google Scholar]
Shin, S.; Kim, Y.; Yu, B.; Lee, E.M.; Seo, D.U.; Myung, H. PanoNetVLAD: Visual Loop Closure Detection in Continuous Space Represented with Panoramic View Using Multiple Cameras. In Proceedings of the IEEE 2023 23rd International Conference on Control, Automation and Systems (ICCAS), Yeosu, Republic of Korea, 17–20 October 2023; pp. 172–177. [Google Scholar]
Xu, Y.; Huang, J.; Wang, J.; Wang, Y.; Qin, H.; Nan, K. ESA-VLAD: A lightweight network based on second-order attention and NetVLAD for loop closure detection. IEEE Robot. Autom. Lett. 2021, 6, 6545–6552. [Google Scholar] [CrossRef]
Sarlin, P.E.; Cadena, C.; Siegwart, R.; Dymczyk, M. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12716–12725. [Google Scholar]
Douze, M.; Guzhva, A.; Deng, C.; Johnson, J.; Szilvasy, G.; Mazaré, P.E.; Lomeli, M.; Hosseini, L.; Jégou, H. The Faiss library. arXiv 2024, arXiv:cs.LG/2401.08281. [Google Scholar]
Lepetit, V.; Moreno-Noguer, F.; Fua, P. EP n P: An accurate O (n) solution to the P n P problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar] [CrossRef]
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Collobert, R.; Bengio, S.; Mariéthoz, J. Torch: A Modular Machine Learning Software Library; IDIAP: Martigny, Switzerland, 2002. [Google Scholar]
Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)?—Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
Burri, M.; Nikolic, J.; Gohl, P.; Schneider, T.; Rehder, J.; Omari, S.; Achtelik, M.W.; Siegwart, R. The EuRoC micro aerial vehicle datasets. Int. J. Robot. Res. 2016, 35, 1157–1163. [Google Scholar] [CrossRef]
Jeong, J.; Cho, Y.; Shin, Y.S.; Roh, H.; Kim, A. Complex urban dataset with multi-level sensors from highly diverse urban environments. Int. J. Robot. Res. 2019, 38, 642–657. [Google Scholar] [CrossRef]
Wenzel, P.; Wang, R.; Yang, N.; Cheng, Q.; Khan, Q.; von Stumberg, L.; Zeller, N.; Cremers, D. 4Seasons: A cross-season dataset for multi-weather SLAM in autonomous driving. In Proceedings of the Pattern Recognition: 42nd DAGM German Conference, DAGM GCPR 2020, Tübingen, Germany, 28 September–1 October 2020; Proceedings 42. Springer: Berlin/Heidelberg, Germany, 2021; pp. 404–417. [Google Scholar]

Figure 1. The pipeline of our loop closure detection approach.

Figure 2. The pipeline of feature extraction. The red points and vectors represent the feature points and descriptors tracked by the VIO originally, while the blue ones are extracted by SuperPoint.

Figure 3. The overview of the MixVPR network. MixVPR first extracts the holistic feature map from the intermediate layer of the backbone network and, after flattening, inputs it into L Feature Mixers. Then, through projection and flattening operations, it projects the output to a compact representation space, resulting in a low-dimensional descriptor.

Figure 4. The process of building a global descriptor database

D B

and query database Q using MixVPR’s global descriptors.

Figure 4. The process of building a global descriptor database

D B

and query database Q using MixVPR’s global descriptors.

Figure 5. The process of loop frame retrieval. In the diagram, the darker the grid color of the vector, the higher the similarity of the corresponding vectors in the query database.

Figure 6. The process of feature matching and relative pose estimation:

p^{i}

represents the feature points of image i, and

d^{i}

represents the local descriptors of image i.

Figure 6. The process of feature matching and relative pose estimation:

p^{i}

represents the feature points of image i, and

d^{i}

represents the local descriptors of image i.

Figure 7. Matching examples from EuRoC dataset. (a) Match pair before and after takeoff. (b) Match pair with large viewpoint changes.

Figure 8. Sequences from KAIST urban dataset used in this section. (a) Entering the same road only in the last segment of the route with a distance of 4.0 km. (b) Entering the same road twice, entering the opposite direction road once, and the total distance is 5.4 km. (c) Multiple approaches to the same road, multiple approaches to the opposite road, very large scenario, and 11.74 km. (d) The scenario is basically the same as scene 28, but the road conditions and the scene are more complex than scene 28. (e) The scenario is partly the same as scenarios 28 and 38, but the scene has the highest complexity.

Figure 9. Matching examples from KAIST dataset. (a) Loop closure for long-term tasks. (b) Loop closure in the presence of multiple object occlusions. (c) Loop closure at long distances. (d) Loop closure under strong light conditions.

Figure 10. Matching examples in 4Seasons dataset. (a) Loop closure between spring and summer. (b,c) Loop closure between spring and winter.

Figure 11. Self-designed drone equipped with an Intel Realsense D455 camera and NVIDIA Orin Nano platform.

Figure 12. The overall trajectory diagrams for the simple and difficult sequences. The green trajectory represents the odometry, the red trajectory represents the loop closure, and the blue stars indicate the starting and ending points. (a) The route of a simple sequence. (b) The route of a difficult sequence.

Figure 13. Experimental result of the simple sequence. The green line represents the odometry trajectory, the red line represents the loop closure trajectory, the blue stars indicate the starting points, and the white stars mark the endpoints of the trajectories. (a) VINS-Loop. (b) SP-LOOP. (c) Ours.

Figure 14. Experimental results for the difficult sequence. The red line represents the loop closure trajectory, the blue stars indicate the starting points, and the white stars mark the endpoints of the trajectories. (a) VINS-Loop. (b) SP-LOOP. (c) Ours.

Figure 15. Matching examples in real-world experiment. (a) Loop closure at long distances. (b) Loop closure between the starting and ending positions for the simple sequence. (c) Loop closure for long-term tasks. (d) Loop closure between the starting and ending positions for the difficult sequence.

Table 1. Comparison with other approaches in the EuRoC dataset using RMSE (m). The best results are highlighted in bold.

Sequence	VINS-VIO	VINS-Loop	SP-LOOP	Ours
V1_01_easy	0.098	0.048	0.042	0.044
V1_02_medium	0.098	0.055	0.034	0.050
V1_03_difficult	0.103	0.103	0.082	0.057
V2_01_easy	0.127	0.053	0.038	0.059
V2_02_medium	0.123	0.088	0.054	0.056
V2_03_difficult	0.345	0.085	0.10	0.075
MH_01_easy	0.160	0.049	0.070	0.058
MH_02_easy	0.177	0.064	0.044	0.033
MH_03_medium	0.315	0.065	0.068	0.057
MH_04_difficult	0.319	0.108	0.10	0.079
MH_05_difficult	0.177	0.095	0.09	0.081
Average	0.186	0.074	0.066	0.059

Table 2. Comparison with other approaches in the KAIST urban dataset using RMSE (m). The best results are highlighted in bold.

Sequence	VINS-VIO	VINS-Loop	SP-LOOP	Ours
urban26	14.126	11.663	14.122	5.445
urban27	21.359	17.655	21.204	7.256
urban28	12.024	6.209	11.432	5.217
urban38	8.879	7.833	7.356	8.119
urban39	12.831	5.772	7.310	5.623
Average	13.844	9.826	12.285	6.332

Table 3. Comparison with other approaches for the 4Seasons dataset using RMSE (m). The best results are highlighted in bold.

Sequence	VINS-VIO	VINS-Loop	Ours
1_spring _sunny	88.413	49.764	20.735
3_summer _sunny	73.317	34.887	16.968
5_winter_snowy	57.252	16.913	19.681
Average	72.994	33.855	19.128

Table 4. Comparison with other approaches in the real-world experiment using ATE (m). The best results are highlighted in bold.

Sequence	VINS-VIO	VINS-Loop	SP-LOOP	Ours
outdoor_easy	1.337	1.266	0.122	0.115
outdoor_difficult	41.745	29.807	2.681	0.473

Table 5. The runtime (ms) for each part of our approach on the desktop.

Dataset	Global Feature Extraction	Local Feature Extraction	Keyframe Retrieval	Feature Matching
EuRoC	1.5	6.1	0.8	4.1
KAIST	2.3	10.3	1.7	3.8
4Seasons	2	5.4	2.2	4.1

Table 6. The time cost (ms) and GPU memory cost (MB) of our approach on the desktop.

Total Cost	Dataset
Total Cost	EuRoC	KAIST	4Seasons
Time cost	12.5	18.1	13.7
GPU memory cost	422	600	400

Table 7. The runtime (ms) for each part of our approach on the Orin Nano.

Dataset	Global Feature Extraction	Local Feature Extraction	Keyframe Retrival	Feature Matching
EuRoC	8.48	53	0.7	28.3
KAIST	10.2	98.48	1.8	33.4
4Seasons	8.71	57.88	2.3	26.7

Table 8. The time cost (ms) and GPU memory cost (MB) of our approach on the Orin Nano.

Total Cost	Dataset
Total Cost	EuRoC	KAIST	4Seasons
Time cost	90.48	143.88	95.59
GPU memory cost	753	824	734

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shi, Y.; Li, R.; Shi, Y.; Liang, S. A Robust and Lightweight Loop Closure Detection Approach for Challenging Environments. Drones 2024, 8, 322. https://doi.org/10.3390/drones8070322

AMA Style

Shi Y, Li R, Shi Y, Liang S. A Robust and Lightweight Loop Closure Detection Approach for Challenging Environments. Drones. 2024; 8(7):322. https://doi.org/10.3390/drones8070322

Chicago/Turabian Style

Shi, Yuan, Rui Li, Yingjing Shi, and Shaofeng Liang. 2024. "A Robust and Lightweight Loop Closure Detection Approach for Challenging Environments" Drones 8, no. 7: 322. https://doi.org/10.3390/drones8070322

APA Style

Shi, Y., Li, R., Shi, Y., & Liang, S. (2024). A Robust and Lightweight Loop Closure Detection Approach for Challenging Environments. Drones, 8(7), 322. https://doi.org/10.3390/drones8070322

Article Menu

A Robust and Lightweight Loop Closure Detection Approach for Challenging Environments

Abstract

1. Introduction

2. Related Works

2.1. Robust Feature Matching

2.2. Robust Loop Closure Detection

3. Methods

3.1. System Overview

3.2. Feature Extraction

3.3. Keyframe Retrival

3.4. Feature Matching and Relative Pose Estimation

4. Implement Details

5. Experiment

5.1. EuRoC

5.2. KAIST

5.3. Extremely Challenging Test

5.4. Real-World Experiments

5.5. Time and Memory Costs

6. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI