Loop Closure Detection Based on Compressed ConvNet Features in Dynamic Environments

Jiang, Shuhai; Zhou, Zhongkai; Sun, Shangjie

doi:10.3390/app14010008

Open AccessArticle

Loop Closure Detection Based on Compressed ConvNet Features in Dynamic Environments

by

Shuhai Jiang

^1,2,*,

Zhongkai Zhou

^1,2

and

Shangjie Sun

^1,2

¹

College of Mechanical and Electronic Engineering, Nanjing Forestry University, Nanjing 210037, China

²

Institute of Intelligent Control and Robotics (IICR), Nanjing Forestry University, Nanjing 210037, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(1), 8; https://doi.org/10.3390/app14010008

Submission received: 6 November 2023 / Revised: 13 December 2023 / Accepted: 14 December 2023 / Published: 19 December 2023

(This article belongs to the Section Robotics and Automation)

Download

Browse Figures

Versions Notes

Abstract

:

In dynamic environments, convolutional neural networks (CNNs) often produce image feature maps with significant redundancy due to external factors such as moving objects and occlusions. These feature maps are inadequate as precise image descriptors for similarity measurement, hindering loop closure detection. Addressing this issue, this paper proposes feature compression of convolutional neural network output. The approach is detailed as follows: (1) employing ResNet152 as the backbone feature-extraction network, a Siamese neural network is constructed to enhance the efficiency of feature extraction; (2) utilizing KL transformation to extract principal components from the backbone network’s output, thereby eliminating redundant information; (3) employing the compressed features as input for NetVLAD to construct a spatially informed feature descriptor for similarity measurement. Experimental results demonstrate that, on the New College dataset, the proposed improved method exhibits an approximately 9.98% enhancement in average accuracy compared to the original network. On the City Center dataset, there is an improvement of approximately 2.64%, with an overall increase of about 23.51% in time performance. These findings indicate that the enhanced ResNet152 performs better than the original network in environments with more moving objects and occlusions.

Keywords:

loop closure detection; visual slam; feature compression; convolutional neural network; KL transformation

1. Introduction

Simultaneous Localization and Mapping (SLAM) technology is a key technique applied to achieve self-localization and navigation in mobile robots and has been the subject of extensive research in recent years [1]. Visual SLAM, a subset of SLAM based on visual information [2], employs loop closure detection to effectively address the issue of pose drift in robot navigation [3]. Visual odometry, due to inherent accuracy biases or noise introduced by the current environment, produces data with noise, causing deviations in the pose estimation of the front-end visual odometry. Over time, these deviations accumulate, resulting in cumulative errors that can severely impact back-end optimization or mapping. Through loop closure detection, it becomes possible to determine whether the current environment is one that the robot has traversed before. This feature enables the adjustment of the robot’s pose information, and, through back-end optimization, associates the current environment with previously built map information, facilitating the correction and updating of the map [4].

Loop closure detection can be conceptualized as a scene-recognition problem [5] involving the extraction of feature information from input images for similarity measurement to determine whether the current scene is similar to historical scenes. Traditional appearance-based loop closure-detection methods often utilize handcrafted features as the image’s feature representation, with the bag-of-words (BoW) model [6] being the most representative. This method employs classical local feature points such as SIFT [7], SURF [8], and ORB [9] to form a set of local features for the image. The system treats the descriptors generated by processing the current image’s features as a “vocabulary” and uses the K-means clustering method [10] to categorize similar types of vocabulary, constructing a large visual dictionary. Subsequently, each image can be represented in the form of a set of words through the appropriate retrieval tools. FAB-MAP, proposed by Cummins et al. [11], is a method that combines Bayesian models with the bag-of-words model, offering long-term stability in loop closure-detection modules. The dictionary constructed by FAB-MAP typically forms a tree-like structure containing independent branches composed of numerous clustering centers. When a new image is inputted, the system extracts its feature points and maps them into a combination form weighted by words. Based on this representation, each original image can be described by the vectors used for measuring similarity between images.

Typically, the constructed bag-of-words (BoW) model is offline, and its main drawback lies in the fact that a well-trained visual dictionary may not adequately represent a new environment. Early researchers addressed this issue by incrementally constructing dictionaries. Angeli et al. [12] combined binary-based descriptors to incrementally process the BoW and employed Bayesian filtering for loop closure detection. In recent years, there has been a growing body of research focusing on incremental methods for the bag-of-words model. TSINTOTAS et al. [13] proposed an incremental BoW method called BoTW based on feature tracking. They utilized the KLT point tracker to construct the dictionary and incorporated a guided feature-selection technique. Li, Y.N. et al. [14] designed an incremental bag-of-words approach with gradient direction histograms. They partitioned and clustered image blocks into local regional features and represented them using gradient direction histograms. In loop closure detection, the Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) algorithm was employed to dynamically generate and update vocabulary clustering features (CF) trees in real-time. This algorithm, combined with an inverted index, was used for efficient candidate selection and similarity computation.

However, most loop closure-detection methods based on handcrafted features are primarily effective in static environments and struggle to handle drastic changes in viewpoints and variations in lighting conditions, making them less accurate in accurately identifying loops in complex dynamic environments. In recent years, the rapid development of deep learning has introduced new approaches for loop closure detection. Xiang GAO et al. [15] utilized autoencoders to extract deep features from images and employed a similarity matrix for loop closure-detection decision-making. Building on prior work, Xia et al. [16] used the deep learning network PCANet to extract features from images, demonstrating that features extracted by this network outperform manually designed features. MA et al. [17] designed a lightweight autoencoder feature extractor and introduced a local matching algorithm based on motion-vector consistency constraints, ensuring high recall rates while enhancing real-time performance. Zhang et al. [18] combined residual networks and capsule networks to simultaneously extract shallow geometric features and deep semantic features from images, reducing noise in images and accelerating model convergence.

Among various deep learning methods, convolutional neural networks (CNNs) possess the capability to learn hierarchical features directly from raw data. The invariant features output by the convolutional layers enable them to accomplish highly discriminative and complex visual tasks [19]. CNNs exhibit invariance to rotation, translation, and lighting conditions, making them well suited to optimizing loop closure detection in visual SLAM, such as in image retrieval and classification tasks. Unlike handcrafted features based on human intuition, CNNs can be trained using datasets like ImageNet [20], Places-365 [21] and MS-COCO [22], allowing the network to automatically learn representations from a large amount of training data. This adaptability to different environmental conditions is a notable advantage. Y. Jia et al. [23] trained an image feature extractor on the Places dataset using the Caffe framework and observed that, when lighting conditions changed, the performance of the convolutional network was significantly superior to that of manually designed image descriptors. HOU Y et al. [24] compared image features extracted by CNN with traditional descriptors and found that, in high-light environments, CNN features exhibited superiority, with a noticeable speed advantage on the CPU compared to handcrafted features. SUNDERHAUF N et al. [25] conducted research revealing that the top-layer encoding of CNNs possesses a certain level of robustness to changes in perspective, addressing the issue of traditional bag-of-words models being unsuitable for adaptation to drastic viewpoint changes.

However, the performance of convolutional neural networks (CNNs) continues to face new challenges in complex outdoor environments, particularly in the presence of moving objects and obstacles. In such environments, the image features extracted by the backbone network often contain significant redundancy. Addressing this issue, LIU Y et al. [26] modified the convolutional network and proposed a Hybrid Deep Learning Architecture (HDLA) specifically designed to generate high-level semantic image features for loop closure detection. This learning framework can generate advanced semantic features for images, enabling accurate loop detection even in complex environments. Kim, J [27], discovered a strong correlation between the image features extracted by CNN and the bag-of-words model. They combined the global features of CNN with image descriptors generated based on the bag-of-words model, establishing a loop closure-detection system with spatial awareness and semantic information.

While the aforementioned methods partially address issues arising from environmental changes, deep convolutional neural networks, due to the stacking of intermediate layers, often involve a large number of parameters, making the process of extracting image features time-consuming and resource-intensive in terms of memory usage. Therefore, some researchers have shifted their focus to studies on post-processing after dimensionality reduction in convolutional neural networks. Kuang, X et al. [28] combined stacked autoencoder networks for data dimensionality reduction with convolutional neural networks for classification. They leveraged the advantages of multidimensional remote-sensing information to alleviate the impact of dimensionality on classification accuracy, enhancing the effectiveness of crop classification. Chen, Y et al. [29] integrated a pre-trained CNN model with an autoencoder, proposing CNN-AE to obtain low-dimensional feature vectors as image representations. They decomposed and reconstructed the similarity matrix to address the issue of perceptual aliasing in loop closure detection. Hossain, M et al. [30] combined Principal Component Analysis (PCA) with convolutional neural networks to reduce the dimensionality of hyperspectral images, eliminating non-linear consistency features between wavelengths and improving the accuracy of visualization and classification. However, autoencoders and PCA have limitations in terms of understanding the data, and when dealing with images in complex environments, they may lead to the loss of important data.

To gain a better understanding of the strengths and weaknesses of various loop closure-detection methods, this work summarizes and consolidates the methods discussed above, along with their respective advantages and disadvantages (Table 1). Through Table 1, it can be observed that the traditional bag-of-words model has lost its advantage in scene generalization compared to deep learning methods. However, deep learning, specifically CNNs, sacrifices memory and computation time in exchange for efficient image feature representation. In addressing this challenge, the research referenced in this paper explores Principal Component Analysis (PCA), which can reduce the dimensionality of the feature vectors output by CNN. Nevertheless, in the compression process, if only a small portion of the principal components is retained, the information loss may impact the model’s understanding of the input data.

Therefore, this paper proposes a loop closure-detection method suitable for dynamic environments. The approach employs KL transformation to compress the features extracted by CNN, addressing the impact of factors such as moving objects and environmental occlusions on the accuracy of loop closure detection in dynamic environments. In comparison to direct PCA-based dimensionality reduction, KL transformation achieves dimensionality reduction by maximizing information content, placing greater emphasis on the interdependence of information within the data. This approach helps preserve key features of the data more effectively. Additionally, KL transformation is a non-linear change, offering greater flexibility in capturing non-linear relationships within the data compared to the linear transformation of PCA. KL-transformed feature matrices from the CNN output serve as input to NetVLAD, constructing image-feature descriptors with spatial-representation capabilities for similarity detection. The overall loop closure-detection process can be divided into three steps: backbone network feature extraction, feature compression, and image-sequence similarity detection.

In summary, the research conducted in this paper includes the following:

Utilization of Pre-trained ResNet152 as Backbone Feature Extractor: This work employed a pre-trained ResNet152 as the backbone feature-extraction sub-network and designs a Siamese neural network with shared weights. After training, this network can perform holistic feature extraction on input image sequences, generating advanced image feature vectors for image-similarity detection. Considering that the primary objective of loop closure detection is to identify the location of the images, the Places-365 dataset was selected for feature-extraction training during the process of network training.
Incorporation of KL Transformation for Feature Compression: To mitigate the impact of dynamic environments on image-similarity detection, reduce the probability of false detections, and preserve important information within features to the maximum extent, this paper introduced KL transformation as a standalone feature-compression module. Inserted at the output end of the Siamese neural network, it compresses and reduces the dimensionality of the high-level feature information outputted by the network. The KL transformation aims to eliminate redundancy in information and adjusts the compression ratio based on different scenes and object configurations in the images.
Generation of Spatially Informed Image Descriptors using NetVLAD: The compressed features were inputted into NetVLAD to generate image descriptors with spatial information for similarity detection. The proposed method was tested using publicly available datasets. To evaluate model performance, different backbone networks for feature extraction and various compression methods were tested in loop closure detection. Results indicate that the compressed features exhibit enhanced representational capabilities for dynamic scenes.

This paper is divided into four main sections. The Section 1 is the introduction, providing an overview of the trends in related or similar research and highlighting the innovative aspects of the proposed method. The Section 2 covers the theoretical derivation and the detailed implementation of the proposed method. The Section 3 presents the experimental analysis, including details of the experimental platform and datasets used, as well as an analysis of performance metrics for the network model. Finally, the Section 4 comprises the conclusion.

2. Materials and Methods

This chapter provides a detailed description of the overall process of the loop closure-detection method proposed in this paper. The entire process can be outlined as follows:

Input images into the Siamese neural network pre-trained on the Places-365 dataset for global feature extraction;
Utilize the KL compression module to compress the extracted features;
Input the compressed features into NetVLAD to generate image descriptors for similarity detection.

Figure 1 shows an illustration of the process.

2.1. Feature Extraction by the Backbone Network

The Siamese Network [31] is a sample learning method based on similarity measurement that has achieved significant success in various fields such as scene recognition and audio processing. This network has a simple structure, making it easy to train, and consists of two identical backbone networks as sub-networks. The network takes two image sequences as input and outputs a pair of similar images based on a similarity-detection method applied to features extracted by the backbone network. The network architecture is illustrated in Figure 2.

Here,

X_{1}

and

X_{2}

are the input parameters of the network,

G_{w} (X)

represents two identical convolutional neural networks, i.e., sub-networks, where

W

denotes the shared network weights across the two sub-networks and

E_{w}

is the similarity score of the outputs from the two networks. This network is a feedforward network. In image similarity detection, each input consists of two different images and a label. The label is encoded in binary form, where a label of 0 indicates that the two images are dissimilar, and a label of 1 indicates that the two images are similar. The input images undergo feature extraction by convolutional neural networks, and the extracted features are flattened to output feature vectors. The distance between these two feature vectors is then calculated and fed into the loss function. The shared network weights are optimized through the loss function. Throughout this process, the features of the two images are mapped to a new subspace where positive samples are close to each other and negative samples are far apart, completing the optimal matching.

For image-classification tasks, the number of neurons in the output layer of a neural network is typically equal to the number of categories. However, in the research presented in this paper, only similarity detection is required, and classification is not considered. The fully connected layer can be directly input into the loss function for similarity calculation. Generally, Euclidean distance or cosine distance is used, and there is no restriction on the dimensions of the output from the fully connected layer. Therefore, there are many choices for the backbone network of the Siamese neural network in this project, including classical models like VGG16, AlexNet, GoogleLeNet, ResNet, Inception-v4, etc. Canziani, A. et al. [32] conducted Top1 accuracy tests on some commonly used CNN network models using practical applications, and the test results are shown in Figure 3.

From the graph, it can be observed that, compared to various CNN model architectures, ResNet-34, ResNet-101, and ResNet-152 outperform some earlier networks. This improvement occurred because the early CNN models improved performance by simply increasing the width and depth of the network, with the VGG in Figure 3 designed based on the AlexNet using this approach. However, merely increasing the number of layers in the network does not necessarily lead to better performance for CNNs. Once the number of layers in the network reaches a certain quantity, the network’s performance tends to saturate. Further increasing the number of layers at this point can lead to issues like gradient explosion and degradation, thereby reducing the model’s performance. This phenomenon is not caused by overfitting, but rather is due to the increase in computational cost and parameter requirements as the number of network layers grows, making the model challenging to train. By contrast, deeper networks can capture more information from images, resulting in relatively richer features. Especially in complex dynamic environments, the network should differentiate between different scenes as much as possible and recognize the same scene when it contains moving objects and occlusions.

ResNet [33] is a convolutional neural network with a residual structure proposed by Kaiming He and his colleagues from Microsoft Research and based on VGG19. The initial purpose of designing this network was to address the issues of gradient explosion and degradation caused by the continuous increase in the number of intermediate layers (convolutional layers, pooling layers, and subsampling layers) in traditional convolutional neural networks. The residual structure is inspired by shortcut connections in highway networks, introducing a “skip connection” to form a learning structure, as illustrated in Figure 4. The core idea is to stack new layers onto a shallow CNN, creating a deep CNN. For a newly added layer, assuming the input is x and the image features extracted based on x in the current layer are denoted as

F (x)

, theoretically, the output of the features by this layer would be as follows:

H (x) = F (x) + x

(1)

Here,

H (x)

represents the output features and the residual corresponds to the difference between the input and output, i.e.,

F (x)

. Now consider an extreme case wherein the new layer has not learned any features. the residual should be zero in this scenario. Consequently, the layer would directly output the original features

x

through the skip connection, making the new layer essentially perform an “identity mapping” relative to the old layer. However, in practical situations, the residual is generally small but not directly zero. This feature makes training the residual easier by this method compared to training by traditional methods.

For ResNet with different numbers of layers, the generally adopted residual modules consist of the two types of residual structures illustrated in Figure 4. The left-side residual module is utilized for shallow ResNet architectures such as ResNet18, ResNet34, and ResNet50. In contrast, the right-side residual module is employed for deeper ResNet architectures like ResNet101 and ResNet152, which take into consideration the dimensionality-reduction requirement. Given the remarkable performance of ResNet, this paper employs ResNet152 as the backbone sub-network for feature extraction, constructing a Siamese Network architecture. The network structure can be broadly categorized into the image input layer, pooling layer, convolutional layers, and feature output layer, as depicted in Figure 5.

The dashed-line portion represents the intermediate convolutional layers, where the parameters denote the number of channels, width, and height for each output feature map. The input section consists of a convolutional layer with a kernel size of 7 × 7 and a stride of 2, followed by a pooling layer with a kernel size of 3 × 3 and a stride of 2, primarily for dimensionality reduction of the input image. The middle section comprises layers 1 to 4, where convolutional layers of the same color form a layer. Layer 1 consists of 3 residual modules; layer 2 has 4 residual modules; layer 3 contains 36 residual modules; and layer 4 includes 3 residual modules. Each residual module is composed of two convolutional layers with a size of 1 × 1 and a stride of 1 and one convolutional layer with a size of 3 × 3 and a stride of 1. The residual modules are interconnected through skip connections. Ultimately, given an input image of size 224 × 224 with 3 channels, layer 1 produces feature maps of size 56 × 56. The output size decreases by half in each subsequent layer, resulting in a final feature-map size of 7 × 7 × 2048. As max-pooling layers significantly reduce the precision of image features, making it unsuitable for similarity detection in dynamic scenes with rich features, direct use of such large feature-map dimensions is computationally inefficient. Additionally, it contains many abstract invariant details, leading to interference from redundant information in dynamic environments during similarity detection.

Therefore, in our work, we will focus on the feature maps output by each layer of the ResNet rather than directly processing the top-layer output. Considering the case of the output for a single image, let

l \in {1,2, \dots, N}

represent the selected convolutional layer. The feature map output for the current layer can be denoted as

x_{l}

:

x_{l} = {[\begin{matrix} x_{11} & x_{12} & \dots & x_{1 w} \\ x_{21} & x_{22} & \dots & x_{2 w} \\ \dots & \dots & \dots & \dots \\ x_{h 1} & x_{h 2} & \dots & x_{h w} \end{matrix}]}_{h \times w}, l \in {1,2, \dots, N}

(2)

Here,

w

and

h

represent the width and height of the image, respectively, and

N

is the total number of layers in the network. The feature map

x_{l}

is flattened into a one-dimensional vector to construct the image feature descriptor:

x_{i}^{l} = [x_{11}^{i}, x_{12}^{i}, \dots x_{1 w}^{i}, x_{21}^{i}, x_{22}^{i}, \dots x_{h w}^{i}], i \in \{1,2, \dots c\}

(3)

Here,

i \in {1,2, \dots c}

represents the total number of output feature maps for the

l

-th convolutional layer. Now, we obtain the feature vectors of the

i

-th image in the sequence from the

l

-th layer of ResNet152. Keeping

l

unchanged, by combining the feature vectors of all feature maps, we can obtain the feature matrix

X_{l}

:

X_{l} = [x_{1}^{l}, x_{2}^{l}, \dots x_{c}^{l}]

(4)

X_{l}

can be considered as an overall descriptor for the image features. However, due to the interference of redundant information in dynamic environments, the accuracy of

X_{l}

in similarity detection may decline. The redundant information in the feature maps can be attributed to the presence of dynamic objects, environmental changes, and factors like occlusion, where the output loop after similarity detection often represents the same scene. In different periods of a day, variations in background, moving objects, and lighting conditions within the same scene can lead to changes in the features extracted from the images.

Based on these considerations, the authors of this study chose to investigate the New College and City Center datasets. The images collected in the New College dataset contain numerous dynamic objects, primarily people, with most appearing in the same scene. The City Center dataset primarily captures images of communities and roads, featuring a variety of static objects such as cars, pedestrians, and obstacles. Detailed information about the datasets will be presented in the experimental section.

To analyze the differences in features extracted from the same scene in dynamic environments, this study utilizes Grad CAM (Gradient-weighted Class Activation Mapping) [34] to visualize the feature maps output by ResNet152. This process involves converting the network’s output feature maps into heatmap representations, allowing observation of the regions that the network focuses on. Figure 6 and Figure 7 demonstrate the impact of people as moving objects in the same scene on the network’s feature maps in the New College dataset.

In Figure 6, the red area indicates the part of the image that is of interest to the network, while the blue area represents the part that the network is not interested in. From Figure 6a,b, it can be observed that the pedestrian area in the image is of interest, and the network treats the trees as part of the background of the image. In Figure 6c,d, the pedestrian blocks are not significantly activated, and the network treats the grass as the image background. In Figure 6e,f, only a small portion of the pedestrian blocks are activated, and most of them remain unactivated.

From Figure 7a–d, it can be seen that the influence of pedestrians as moving objects is more pronounced, and the pre-trained network focuses more on the architectural features in the scene, such as windows, walls, etc., paying relatively little attention to pedestrians. Figure 8 and Figure 9 illustrate the impact of environmental changes and occlusions on network feature maps in the City Center dataset within the same scene.

From Figure 8a–d, it is evident that the network pays more attention to scene features with rich lighting conditions, while the shadowed areas are treated as the background. The shadowed area in Figure 8a is significantly larger than that in Figure 8c, resulting in different focus positions in the two images. In similarity detection, this difference might lead to misclassification of the images as non-similar objects. In Figure 9a–d, the effect of bicycles as occluding objects is more pronounced. It can be observed that due to the presence of occluding objects, the network’s focus on features changes direction.

In summary, the pre-trained network is capable of recognizing some background objects in the scene, such as trees, walls, windows, and railings. In the context of visual SLAM, these background objects can serve as landmarks, which is advantageous for loop detection. However, the disadvantages of this network arise due to changes in lighting conditions, the movement of pedestrians, and the appearance of occluding objects. These factors lead to variations in the network’s focus on specific features, including pedestrian features and redundant features detected due to changes in the background. Both of these aspects, as redundant information in the image feature region, cause significant interference in subsequent similarity detection, thereby hindering loop detection.

2.2. Feature Compression

In dynamic environments, selecting appropriate output feature maps for convolutional layers from a CNN is crucial. The closer the chosen convolutional layer is to the input, the more internal redundant information it contains. For ResNet152, when dealing with a large number of test samples, extracting high-level feature representations for low-resolution input images involves many abstract uncertainties. In loop detection, the goal is for the system to accurately identify previously visited scenes in dynamic environments, a task susceptible to such uncertainties. S. Bannour et al. [35] used a recursive least squares-type algorithm for sequential training of feature extraction in images, extracting principal components of the input part. They successfully demonstrated that a certain dimensionality-reduction method could compress feature information in color images, addressing the aforementioned issues. Following this approach, our study employs KL transformation to preprocess the output feature maps from a CNN, aiming to compress and retain crucial information in the image while minimizing the impact of non-essential information on subsequent image similarity detection.

The KL transformation [36] calculates the correlation between various components in the input and compresses the input information based on the eigenvalues and eigenvectors of the covariance matrix. Therefore, the KL transformation can preserve the main features of the original image while extracting and combining key information, forming new feature information where each element is uncorrelated. This process allows for a reduction in image dimensions while preserving essential features and, simultaneously, removing noise from the image information. The flowchart of the KL transformation technique as applied to images is illustrated in Figure 10.

Through the feature extraction conducted by the backbone network, as discussed in Section 2.1, we obtained vector representations

X_{l} = [x_{1}^{l}, x_{2}^{l}, \dots x_{c}^{l}]

for the output feature maps of each convolutional layer. The size was set as

n = c \times h \times w

, where

c

represents the number of feature maps output by convolutional layer

l

, and

h \times w

denotes the dimensions of a single image. However, as outlined in the preceding sections, we utilized a Siamese neural network to simultaneously detect two sequences of images rather than directly employing individual images. Therefore, considering the feature representations for all images within a sequence, we constructed their corresponding feature vectors:

X_{l} (I) = [x_{1}^{l} (I), x_{2}^{l} (I), \dots x_{n}^{l} (I)]

(5)

Here,

I \in {1,2, \dots m}

denotes the number of images in the sequence. Considering all the images within the sequence, the feature matrix formed by the feature vectors of convolutional layer

l

output is given by:

X_{l} = {[X_{l} (1), X_{l} (2), \dots X_{l} (m)]}^{T} = {[\begin{matrix} x_{1}^{l} (1) & x_{2}^{l} (1) & \dots & x_{n}^{l} (1) \\ x_{1}^{l} (2) & x_{2}^{l} (2) & \dots & x_{n}^{l} (2) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{1}^{l} (m) & x_{2}^{l} (m) & \dots & x_{n}^{l} (m) \end{matrix}]}_{m \times n}

(6)

From Equation (6), it can be seen that

X_{l}

is a large-scale feature matrix. Its rows represent all the feature maps of a single image output by the current convolutional layer, and its columns represent the feature maps of all images at the corresponding position in the output sequence of the current convolutional layer. It can be observed that

X_{l}

contains rich feature information, including interference from dynamic objects and occlusions. Directly processing it would significantly increase computation time and consume memory resources.

Next, we applied the KL transform to reduce dimensionality and remove noise from

X_{l}

. First, we blockwise partitioned

X_{l}

column-wise to obtain

X_{l} = {[\begin{matrix} x_{1}^{l} (1) & x_{2}^{l} (1) & \dots & x_{n}^{l} (1) \\ x_{1}^{l} (2) & x_{2}^{l} (2) & \dots & x_{n}^{l} (2) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{1}^{l} (m) & x_{2}^{l} (m) & \dots & x_{n}^{l} (m) \end{matrix}]}_{m \times n} = [x_{1}^{l} (i), x_{2}^{l} (i), \dots x_{n}^{l} (i)]

(7)

After partitioning

X_{l}

, the mean was calculated to obtain

E (X_{l}) = \frac{1}{m} [\sum_{i = 1}^{m} x_{1}^{l} (i), \sum_{i = 1}^{m} x_{2}^{l} (i), \dots, \sum_{i = 1}^{m} x_{n}^{l} (i)] = [\bar{x_{1}}, \bar{x_{2}}, \dots, \bar{x_{n}}]

(8)

Combining Equations (7) and (8), the difference between

X_{l}

and E(

X_{l}

) can be obtained as

X_{l} - E (X_{l}) = {[\begin{matrix} x_{1}^{l} (1) - \bar{x_{1}} & x_{2}^{l} (1) - \bar{x_{2}} & \dots & x_{n}^{l} (1) - \bar{x_{n}} \\ x_{1}^{l} (2) - \bar{x_{1}} & x_{2}^{l} (2) - \bar{x_{2}} & \dots & x_{n}^{l} (2) - \bar{x_{n}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{1}^{l} (m) - \bar{x_{1}} & x_{2}^{l} (m) - \bar{x_{2}} & \dots & x_{n}^{l} (m) \bar{- x_{n}} \end{matrix}]}_{m \times n} = [x_{1}, x_{2}, \dots x_{m}]

(9)

The covariance matrix of

X_{l}

is defined as

C_{l} = E \{[X_{l} - E (X_{l})] {[X_{l} - E (X_{l})]}^{T}\} = \frac{1}{m} ([x_{1}, x_{2}, \dots x_{m}] {[x_{1}, x_{2}, \dots x_{m}]}^{T})

(10)

where

C_{l}

is a m × m positive definite symmetric matrix. There exist

n

mutually orthogonal eigenvectors

U

and corresponding eigenvalues

Λ

such that

C = U Λ U^{T}

(11)

U = {[\begin{matrix} u_{11} & u_{12} & \dots & u_{1 m} \\ u_{21} & u_{22} & \dots & u_{2 m} \\ \dots & \dots & \dots & \dots \\ u_{m 1} & u_{m 2} & \dots & u_{m m} \end{matrix}]}_{m \times m} = {[u_{1}, u_{2}, \dots u_{m}]}_{m \times m}

(12)

Λ = {[\begin{matrix} λ_{1} & 0 & \dots & 0 \\ 0 & λ_{2} & \dots & 0 \\ \dots & \dots & \dots & \dots \\ 0 & 0 & \dots & λ_{m} \end{matrix}]}_{m \times m}

(13)

To reduce the dimensionality of

X_{l}

, let

k

be the index of the selected eigenvectors. The key to using the KL transformation is as follows:

Y_{l} = U X_{l} = {[u_{1}, u_{2}, \dots u_{k}]}_{k \times m} {[\begin{matrix} x_{1}^{l} (1) & x_{2}^{l} (1) & \dots & x_{n}^{l} (1) \\ x_{1}^{l} (2) & x_{2}^{l} (2) & \dots & x_{n}^{l} (2) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{1}^{l} (m) & x_{2}^{l} (m) & \dots & x_{n}^{l} (m) \end{matrix}]}_{m \times n}

(14)

where

k

must be smaller than the value of

m

; otherwise, compression cannot be performed.

Y_{l}

is the new matrix obtained by applying the transformation matrix

U

to

X_{l}

. This matrix contains fewer elements than

X_{l}

and is of size

k \times n

, and its eigenvectors corresponding to different eigenvalues are linearly independent with minimum variance. In the KL transform, each eigenvector of the matrix is called a principal component, where the eigenvector corresponding to the largest eigenvalue is the first principal component and the

k

-th eigenvalue

x_{k}

measures the coherent energy magnitude of the

k

-th principal component. The energy of the input features can be represented by the eigenvalues of the covariance matrix:

E_{n} = \frac{\sum_{i = 1}^{k} λ_{i}}{\sum_{i = 1}^{m} λ_{i}}

(15)

Equation (15) expresses the total energy captured by the selected principal components. The KL transformation allows capturing the most significant features while discarding less relevant information.

In Equation (15),

m

is the original length of the eigenvectors and

k

is the length of the eigenvectors to be truncated. This equation indicates that to find the linear transformation matrix

U

, when representing

k

eigenvectors, the top

k

larger eigenvalues should be selected in descending order. As

λ_{j}

represents the magnitude of variance, for the newly transformed features, we aim for larger variances. Information content reflects the complexity and diversity of the data and is thus directly related to its uncertainty. Variance measures the deviation between a random variable and its mean; smaller deviations indicate smaller variances, higher correlation between pieces of information, and greater repetitiveness. By contrast, larger variances suggest a lower correlation between pieces of information, the presence of more information, and greater distinguishability between different samples. This characteristic is necessary for loop detection. In our research work, using

k

directly is not conducive to understanding and cannot serve as a precise standard. Therefore, based on Equation (15), we introduce the compression ratio

γ

:

γ = \frac{a r g E_{n}}{m} \times 100 %

(16)

where the compression ratio

γ

is the percentage form of the ratio between the length of the truncated eigenvectors and the length of the original eigenvectors. Setting different compression ratios can allow measurement of the dimensionality-reduction capability of KL transformation in different scenarios. To validate the feasibility of the above method, we took images from the New College dataset as an example, as shown in Figure 11.

The feature maps extracted from the conv3 of ResNet152 can be represented in matrix form as follows:

X_{3} = [\begin{matrix} 131 & 127 & \dots & 137 \\ 132 & 130 & \dots & 138 \\ \dots & \dots & \dots & \dots \\ 95 & 97 & \dots & 116 \end{matrix}] = [131,132 \dots 95,127, \dots 116]

(17)

Applying KL transformation to map

X_{3}

to a low-dimensional subspace, and flattening it into a one-dimensional vector:

\hat{X_{3}} = [278.21,283.02, \dots 135.12,3.24, - 11.29]

(18)

From

\hat{X_{3}}

, it can be observed that as the index increases, the elements in the vector gradually decrease in magnitude. The result of visualizing this phenomenon is shown in Figure 12. It can be seen that the main components we need to retain are primarily concentrated at the front, while the later part contains redundant information to be removed. Therefore, we can select the first k principal components for feature compression.

2.3. Image Similarity Calculation

In the similarity-calculation phase, it is necessary to compare the similarity of two images through a similarity-measurement method. In Siamese neural networks, the image features extracted by the convolutional neural network as a sub-network are unordered, disregarding the spatial-position information of the features. NetVLAD [37] is a feature-encoding method that enhances the interrelation of features to improve the recognition ability of neural networks. It can capture the relative distribution of features in space. When combined with features compressed by KL transformation, it generates descriptors with stronger representational and anti-interference capabilities.

The training objective of NetVLAD is to solve and optimize the positions of cluster centers and the weights of each descriptor belonging to a cluster center after the KL transformation compression. It calculates the weighted residual between the feature map descriptor and the cluster centers, using it as the image descriptor vector. This vector is designed to have the greatest similarity with other descriptors with minimal cosine distance during image similarity detection. The NetVLAD structure based on ResNet152 is illustrated in Figure 13.

Select the output of the convolutional layers from ResNet152, and after the features are compressed through KL transformation, the feature dimensions become W × H × D. Subsequently, W × H × D-dimensional local descriptors are provided as an input to the NetVLAD layer, with K × D cluster centers from the clustering algorithm as parameters. The computation of the elements in the image description matrix

V

involves calculating the weighted residuals of W × H × D features concerning the cluster centers, and the formula for

V

is as follows:

V (j, k) = \sum_{i = 1}^{N} {\bar{α}}_{k} (Y_{l}) (y_{i} (j) - c_{k} (j))

(19)

where

y_{i} (j)

represents the

j

-th feature value of the

i

-th local feature and

c_{k} (j)

denotes the

j

-th feature value of the

k

-th cluster center. The computation involves calculating the distance weights from each local feature to the cluster centers, adjusting the weights of local feature descriptors

Y_{l}

under each cluster to be in the range of 0–1. Higher weights indicate that the feature is closer to a certain cluster center. Finally,

V

is transformed into vector form, normalized, and output as the image’s feature description vector, resulting in a VLAD vector of length K × D. This vector serves as the original image’s feature descriptor.

The network architecture designed in this paper takes images as input in the form of a sequence, and the network output is a set of feature vectors for a sequence of images, where each element corresponds to the VLAD feature descriptor for the respective image. Therefore, in similarity detection, we directly compute the similarity between two sequences. If the distance between two sequences is close, that result indicates the presence of loops in the sequence, and the higher the similarity between the sequences, the more loops may exist. Conversely, a lower similarity suggests that there are fewer or no loops in the sequence.

Let the input sequence at the current time be

S_{p}

and the selected historical sequence at a previous time be

S_{h}

, both sequences having a length of

l

. The corresponding VLAD feature vectors for the two sequences are denoted as

V_{S_{p}}

and

V_{S_{h}}

:

V_{S_{p}} = [{V L A D}_{p 1}, {V L A D}_{p 2}, \dots, {V L A D}_{p l}] V_{S_{h}} = [{V L A D}_{h 1}, {V L A D}_{h 2}, \dots, {V L A D}_{h l}]

(20)

The standard Euclidean distance is used to calculate the distance between two vectors, as follows:

D (V_{S_{p}}, V_{S_{h}}) = \sum_{i = 1}^{l} {‖{V L A D}_{p i} - {V L A D}_{h i}‖}^{2}

(21)

During the retrieval process, the current image sequence is chosen as the input for the first backbone network and the historical image sequence is chosen as the input for the second backbone network. The distance between the sequences is then calculated. A threshold

α

is set so that sequences with distances below a certain threshold are detected as loops. Adjusting the threshold α allows for a balance between precision and recall.

3. Results

3.1. Introduction to the Dataset

To evaluate the performance of the proposed algorithm in this paper, experiments were conducted using the New College and the City Center datasets. Specific information is presented in Table 2.

New College and City Center datasets were created by Mark Cummins and others at the University of Oxford, using a mobile robot to collect outdoor environment data on the campus of New College, University of Oxford. The collected data includes odometry, laser scans, and visual information. In the outdoor environment, the mobile robot captured approximately 2474 and 2146 pairs of images using a stereo camera, with image dimensions of 640 × 480 pixels in jpg format. The dataset was reorganized based on the images captured by the left and right cameras. All images captured by the right camera were labeled as “New College right” (1073 images) and “City Center right” (1237 images). Similarly, images captured by the left camera were labeled as “New College left” (1073 images) and “City Center left” (1237 images). As the image sequences have been calibrated, there is no need for manual calculation of camera parameters. Example images from the datasets are shown in Figure 14 and Figure 15.

In the New College dataset, loops are generally located at specific positions in the image sequences. Due to the short image-capture intervals, only a portion of the region captured in adjacent images belongs to the loop, but when the region is viewed as a whole, it does not constitute a loop, as shown in Figure 16.

3.2. Evaluation Criteria

In this section, we discuss our use of the classic PR curve to evaluate the performance of loop detection. True Positive (TP) represents the number of correctly detected loops; True Negative (TN) is the number of correctly detected non-loops (new locations); False Positive (FP) indicates the number of loops incorrectly identified by the model; and False Negative (FN) is the number of loops missed by the model. Precision and recall are calculated as follows:

P r e c i s i o n = \frac{T P}{T P + F P} R e c a l l = \frac{T P}{T P + F N}

(22)

We evaluated the algorithm’s performance using the Precision–Recall curve, where a superior loop detection algorithm should aim to maintain high precision while maximizing recall.

3.3. Experimental Comparison

Our proposed method was tested on the New College and City Center datasets with the experimental parameters outlined in Table 3. Initially, ResNet152 was chosen as the main feature-extraction network to assess the impact of features from different convolutional layers on performance in loop detection after compression and transformation. Subsequently, a different compression ratio

γ

was set for each network layer to observe the influence of varying compression ratios on its feature-representation capability and performance in loop detection. As the features extracted from the shallow layers (conv1, conv2) do not adequately reflect the richness needed, they are not suitable for illustrating the advantages of KL transformation in compressing feature matrices. Therefore, in this experiment, we opted to use deeper layers such as conv3, conv4, and conv5. Due to the residual structure of ResNet152, the skip connections enhance the research value of using deeper convolutional layers.

To visually demonstrate the impact of KL transformation on performance in loop closure detection in complex environments, we conducted relevant experiments. Taking ResNet152′s conv5 as an example, we utilized a Siamese neural network to output the feature matrix of image sequences, performed similarity calculations, and visualized the results. These results were then compared with the Ground Truth, which consists of accurate classification data for loop and non-loop scenes in the New College and City Center datasets, as shown in Figure 17.

On the left is the Ground Truth for New College, and on the right is the Ground Truth for City Center. In the image, black regions represent different scenes, while white regions represent the same scene. The similarity-detection threshold is set to

α = 0.3

. In both datasets, the visualized results of loop closure detection are shown in Figure 18. The regions closer to blue in the image indicate lower similarity, while regions closer to red indicate higher similarity. Figure 18a shows the original output of conv5 without KL transformation. It can be observed that due to the presence of redundant information and the abstract information from deep convolutional outputs, the interference factors in the feature matrix are considerable. Although the model can accurately identify loop closures, it also yield many false positives, resulting in an overall low rate of loop closure detection. This result also indicates that the initial choice of compression ratio should be high because lower compression ratios are insufficient to handle this situation. Therefore, in the subsequent experiments, we attempted to set the initial value of

γ

at 60% and gradually increase it, at intervals of 10%, up to 90%. The impact of different

γ

values on the original results was observed, as shown in Figure 19.

In Figure 19, it can be observed that for the New College dataset, after 60% compression, most of the interference information has been removed, and some loops can be accurately detected. However, the similarity in almost all areas still exceeds 60%. When the compression rate reaches 70%, the similarity between non-loop scenes further decreases. When the compression rate reaches 80% and 90%, loop detection can be achieved by setting a lower threshold. The 80% compression rate performs the best, with fewer false positives compared to 90%, while the accuracy is highest for 90%. This is because the New College dataset has short intervals between image captures, and scenes with loops, mainly involving frequent human movement, indicate a noticeable improvement in loop detection under a high compression rate of KL transformation. In contrast, the City Center dataset has fewer moving objects, and scenes with loops often involve significant occlusions. Therefore, even with a compression rate of 70%, a good detection effect can be achieved.

Based on the evaluation method introduced in Section 3.2, we further conducted experiments on the New College and City Center datasets. The goal is to observe, under the condition of 100% accuracy based on Ground Truth, the proportion of correctly detected loops by the ResNet152 network model after undergoing KL transformation.

In the New College dataset, the maximum recall rates of different convolutional layers in ResNet152 under various compression ratios are presented in Table 4, while the precision–recall curves of different convolutional layers in ResNet152 under the same compression ratio are shown in Figure 20. It can be observed that, at low compression rates, there is little difference in performance among conv3, conv4, and conv5 of ResNet152. However, at high compression rates, the conv5 layer has more capability to represent features, achieving a maximum recall rate of 60.14% at a compression ratio of 90%. This indicates that, under high compression rates, the conv5 layer can effectively resist interference from moving objects and occlusions, making it suitable for scenes with a large crowd flow.

In the City Center dataset, the maximum recall rates of different convolutional layers in ResNet152 at various compression ratios under the condition of the highest precision (precision = 100%) are presented in Table 5. The accuracy-recall curves for different convolutional layers of ResNet152 at the same compression ratio are illustrated in Figure 21. Compared to the New College dataset, the conv3 layer at 80% compression already achieves around 50% recall, while the conv4 layer performs better at 70% compression, demonstrating feature-representation capability to similar that of the conv5 layer at 90% compression.

The experimental results indicate that for different convolutional layers of ResNet152, when there are numerous occlusions in the scenes belonging to loops, a lower compression ratio (

γ

= 60% or 70%) is preferable for compressing the outputs of the conv3 and conv4 layers. Conversely, when there are abundant moving objects in scenes associated with loops, a higher compression ratio (80% or 90%) is recommended for compressing the features output by the conv5 layer.

Finally, we compared our proposed method with other commonly used loop closure-detection methods on the New College and City Center datasets, including:

Traditional loop closure detection methods based on manually designed feature points: SeqSLAM [38], FAB-MAP [39].
Deep learning-based loop detection methods: NetVLAD based on ResNet152, VGG [40], and Faster RCNN [41].

The experiment does not consider the output of the network’s lower layers, but selects the conv5 layer of ResNet152 with a compression ratio

γ

set to 80%. Table 6 in the experimental results presents the maximum recall rate for each method at 100% precision. Figure 22 illustrates the PR curves for each method on the New College dataset, and Figure 23 illustrates the PR curves for each method on the City Center dataset.

The experiments primarily focused on comparing deep learning methods. From Table 6, it can be observed that among the three NetVLAD-based methods, the one with VGG as the backbone network exhibits a relatively lower maximum recall at 100% precision. Following that in maximum recall is the original ResNet152, and the post-processed ResNet152 achieves a maximum recall of 60.14% in the New College dataset and 73.38% in the City Center dataset. Particularly, Figure 22 indicates that deep learning-based methods significantly outperform traditional feature-point-based methods. As ResNet152 itself is a deep neural network, it exhibits notably strong performance on both datasets. Moreover, the compressed features can resist interference from moving objects. Coupled with the spatial representation capability of NetVLAD’s output feature descriptors, our proposed method further enhances the performance of the NetVLAD based on the ResNet152 backbone.

In visual SLAM systems, there is a high demand for real-time performance in the loop-closure detection module to achieve synchronous localization and mapping. With this consideration in mind, we randomly selected 2000 images from the New College and City Center datasets and compared the average time required to construct single-image feature descriptors between traditional manual methods and deep learning methods. Table 7 provides statistics on the total time needed to extract all image features and the average time needed to extract features from a single image. It is noteworthy that the traditional methods used in our work have undergone offline training in bag-of-words models and the construction of visual dictionaries.

From the statistical results, it is evident that traditional methods, such as SeqSLAM and FAB-MAP, exhibit longer feature-extraction times for a single image compared to the other four deep learning-based methods. This result suggests that the efficiency of extracting features from deep learning outputs surpasses the efficiency of constructing descriptors by querying dictionaries. These four deep learning methods can be divided into those without the VLAD module and those with the VLAD module. The former, represented in this study by Faster RCNN, is the fastest network, with a feature-extraction time of 11.62 ms per single image, but it does not demonstrate significant advantages in the accuracy comparison presented earlier. The latter group includes VGG_NetVLAD, ResNet152_NetVLAD, and ResNet152_KL_NetVLAD. VGG, being an early CNN network with a long intermediate layer structure, introduces increased complexity due to its continuously stacked convolutional layers, resulting in lower efficiency compared to the other three methods. In contrast, ResNet152, which incorporates residual modules, significantly reduces the network’s consumption of computational resources, shortening the feature-extraction time by approximately 27.88% compared to VGG. The method proposed in this paper, which utilizes KL transformation for post-processing, reduces the dimensionality of the feature vectors from each convolutional layer output of ResNet152. Its average time consumption is approximately 44.83% shorter than that of VGG and 23.51% shorter than that of ResNet152. Although it is not as fast as Faster RCNN, considering its overall accuracy and real-time performance, the method proposed in this paper is practical for application in visual SLAM systems.

4. Conclusions

This paper proposes a loop detection method based on feature compression and convolutional neural networks (CNNs). In contrast to traditional image similarity-detection methods that directly use the global features output by CNNs, the method proposed in this paper compresses and post-processes the network’s output to eliminate redundant information from the features, addressing the impact of moving objects, occlusions, and shadows on image similarity measurement in dynamic environments. Specifically, in the compression operation, a compression ratio was introduced. This feature allows researchers to choose an appropriate compression ratio based on the complexity of the environment. Subsequently, the compressed features are processed through NetVLAD to generate more powerful image descriptors for image similarity detection. Experiments were conducted using two publicly available datasets to evaluate the performance of the proposed method, with evaluation metrics including PR curves and the maximum recall rate at 100% precision. In the first stage of the experiment, we tested various convolutional layers of ResNet152, observing the performance of conv1-conv5 at different compression ratios. It was found that the conv5 layer performs well on scenes with dynamic objects when the compression ratio is set to 80% or 90%. In the second stage of the experiment, we compared our method with feature-based loop detection methods and other deep learning-based loop detection methods. The experimental results show that, with the New College dataset, the proposed improved method achieved an average accuracy improvement of approximately 9.98% compared to the original network, and with the City Center dataset, there was an improvement of about 2.64%. In terms of real-time performance, there was an improvement of approximately 23.51% compared to the original network.

However, in different outdoor environments, the implementation of this method relies on manually selecting the compression ratio. While the network can adapt to most environments with a compression ratio of 70% or higher, when there are significant changes in the background, it is necessary to manually reduce the compression ratio to reduce the impact of changes in the background shadows on feature representation. However, a too-low compression ratio makes it challenging to resist the interference of dynamic objects and weakens the dimensionality-reduction capability of the convolutional layer output features. This tradeoff means that the proposed method is suitable only for environments that are complex but have only small changes in the background overall. Further optimization is needed in the future.

Author Contributions

Conceptualization, S.J.; methodology, S.J. and Z.Z.; software, Z.Z. and S.S.; validation, S.J., Z.Z. and S.S.; formal analysis, S.J.; investigation, S.J.; resources, S.J.; data curation, S.J.; writing—original draft preparation, S.J. and Z.Z.; writing—review and editing, S.J and Z.Z.; visualization, Z.Z. and S.S.; supervision, S.J.; project administration, S.J.; funding acquisition, S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Special Research Fund for Non-profit Sector (No. 201404402-03).

Institutional Review Board Statement

Not applicable to research that does not involve humans or animals.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets generated and/or analyzed during the current study are openly available in the Dynamic Environment Stereo Test which is a public dataset which is available from a given URL for everyone according to their needs https://www.robots.ox.ac.uk/~mobile/IJRR_2008_Dataset/.

Conflicts of Interest

The authors declare no conflict of interest.

References

Davison, A.J.; Reid, I.D.; Molton, N.D.; Stasse, O. MonoSLAM: Real-time single camera SLAM. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 1052–1067. [Google Scholar] [CrossRef] [PubMed]
Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, present and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
Chatial, R.; Laumond, J.P. Position referencing and consistent world modeling for mobile robots. In Proceedings of the IEEE International Conference on Robotics and Automation, St. Louis, MO, USA, 25–28 March 1985; Volume 2, pp. 138–145. [Google Scholar]
Lili, M.; Pantao, Y.; Yuchen, Z.; Kai, C.; Fangfang, W.; Nana, Q. Research on SLAM Algorithm of Mobile Robot Based on the Fusion of 2D LiDAR and Depth Camera. IEEE Access 2020, 8, 157628–157642. [Google Scholar]
Masone, C.; Caputo, B. A Survey on Deep Visual Place Recognition. IEEE Access 2021, 9, 19516–19547. [Google Scholar] [CrossRef]
Garcia, F.E.; Ortiz, A. iBoW-LCD: An Appearance-based Loop Closure Detection Approach Using Incremental Bags of Binary Words. IEEE Robot. Autom. Lett. 2018, 3, 3051–3057. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Ess, A.; Tuytelaars, T.; Van Gool, L. Speeded-Up Robust Features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar] [CrossRef]
Fan, Y.; Gongshen, L.; Kui, M.; Zhaoying, S. Neural feedback text clustering with BiLSTM-CNN-Kmeans. IEEE Access 2018, 6, 57460–57469. [Google Scholar] [CrossRef]
Cummins, M.; Newman, P. Appearance-only SLAM at Large Scale with FAB-MAP 2.0. Int. J. Robot. Res. 2011, 30, 1100–1123. [Google Scholar] [CrossRef]
Angeli, A.; Filliat, D.; Doncieux, S.; Meyer, J.-A. Fast and Incremental Method for Loop-Closure Detection Using Bags of Visual Words. IEEE Trans. Robot. 2008, 24, 1027–1037. [Google Scholar] [CrossRef]
Tsintotas, K.A.; Bampis, L.; Gasteratos, A. Modest-vocabulary loop-closure detection with incremental bag of tracked words. Robot. Auton. Syst. 2021, 141, 103782. [Google Scholar] [CrossRef]
Li, Y.; Wei, W.; Zhu, H. Incremental Bag of Words with Gradient Orientation Histogram for Appearance-Based Loop Closure Detection. Appl. Sci. 2023, 13, 6481. [Google Scholar] [CrossRef]
Gao, X.; Zhang, T. Loop closure detection for visual SLAM systems using deep neural networks. In Proceedings of the 2015 34th Chinese Control Conference (CCC), Hangzhou, China, 28–30 July 2015; pp. 5851–5856. [Google Scholar]
Xia, Y.; Li, J.; Qi, L.; Yu, H.; Dong, J. An evaluation of deep learning in loop closure detection for visual SLAM. In Proceedings of the 2017 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), Exeter, UK, 21–23 June 2017; pp. 85–91. [Google Scholar]
Ma, J.; Wang, S.; Zhang, K.; He, Z.; Huang, J.; Mei, X. Fast and robust loop-closure detection via convolutional auto-encoder and motion consensus. IEEE Trans. Lndustrial Inform. 2021, 18, 3681–3691. [Google Scholar] [CrossRef]
Zhang, X.; Zheng, L.; Tan, Z.; Li, S. Loop Closure Detection Based on Residual Network and Capsule Network for Mobile Robot. Sensors 2022, 22, 7137. [Google Scholar] [CrossRef] [PubMed]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. LEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; Torralba, A. Places: A 10 Million Image Database for Scene Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1452–1464. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context; Springer International Publishing: Cham, Switzerland, 2014. [Google Scholar]
Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 675–678. [Google Scholar]
Hou, Y.; Zhang, H.; Zhou, S. Convolutional neural network-based image representation for visual loop closure detection. In Proceedings of the 2015 IEEE International Conference on Information and Automation, Lijiang, China, 8–10 August 2015; pp. 2238–2245. [Google Scholar]
Sünderhauf, N.; Shirazi, S.; Dayoub, F.; Upcroft, B.; Milford, M. On the performance of convnet features for place recognition. In Proceedings of the 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), Hamburg, Germany, 28 September 2015–2 October 2015; pp. 4297–4304. [Google Scholar]
Liu, Y.; Xiang, R.; Zhang, Q.; Ren, Z.; Cheng, J. Loop closure detection based on improved hybrid deep learning architecture. In Proceedings of the 2019 IEEE International Conferences on Ubiquitous Computing & Communications (IUCC) and Data Science and Computational Intelligence (DSCI) and Smart Computing, Networking and Services (SmartCNS), Shenyang, China, 21–23 October 2019; pp. 312–317. [Google Scholar]
Kim, J.J.; Urschler, M.; Riddle, P.J.; Wicker, J.S. SymbioLCD: Ensemble-Based Loop Closure Detection using CNN-Extracted Objects and Visual Bag-of-Words. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September 2021–1 October 2021; pp. 5425–5431. [Google Scholar]
Kuang, X.; Guo, J.; Bai, J.; Geng, H.; Wang, H. Crop-Planting Area Prediction from Multi-Source Gaofen Satellite Images Using a Novel Deep Learning Model: A Case Study of Yangling District. Remote Sens. 2023, 15, 3792. [Google Scholar] [CrossRef]
Chen, Y.; Zhong, Y.; Wang, W.; Peng, H. Fast and robust loop-closure detection using deep neural networks and matrix transformation for a visual SLAM system. J. Electron. Imaging 2022, 31, 061816. [Google Scholar] [CrossRef]
Hossain, M.M.; Hossain, M.A.; Musa Miah, A.S.; Okuyama, Y.; Tomioka, Y.; Shin, J. Stochastic Neighbor Embedding Feature-Based Hyperspectral Image Classification Using 3D Convolutional Neural Network. Electronics 2023, 12, 2082. [Google Scholar] [CrossRef]
Bromley, J.; Bentz, J.W.; Bottou, L.; Guyon, I. Signature Verification using a “Siamese” Time Delay Neural Network. Int. J. Pattern Recognit. Artif. Intell. 1993, 7, 25. [Google Scholar] [CrossRef]
Canziani, A.; Paszke, A.; Culurciello, E. An Analysis of Deep Neural Network Models for Practical Applications. arXiv 2016, arXiv:1605.07678. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2016, 128, 336–359. [Google Scholar] [CrossRef]
Bannour, S.; Azimi-Sadjadi, M.R. Principal component extraction using recursive least squares learning method. In Proceedings of the 1991 IEEE International Joint Conference on Neural Networks, Singapore, 18–21 November 1991; Volume 3, pp. 2110–2115. [Google Scholar] [CrossRef]
Puchala, D. Approximating the KLT by Maximizing the Sum of Fourth-Order Moments. IEEE Signal Process. Lett. 2013, 20, 193–196. [Google Scholar] [CrossRef]
Arandjelović, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1437–1451. [Google Scholar]
Milford, M.J.; Wyeth, G.F. SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights. In Proceedings of the 2012 IEEE International Conference on Robotics and Automation, Saint Paul, MN, USA, 14–18 May 2012; pp. 1643–1649. [Google Scholar] [CrossRef]
Paul, R.; Newman, P. FAB-MAP 3D: Topological mapping with spatial and visual appearance. In Proceedings of the 2010 IEEE International Conference on Robotics and Automation, Anchorage, AK, USA, 3–7 May 2010; pp. 2649–2656. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]

Figure 1. Overall framework of the approach proposed in this paper.

Figure 2. Siamese Network Architecture.

Figure 3. Top1 accuracy comparison of CNNs.

Figure 4. Residual learning structure.

Figure 5. ResNet152 architecture diagram.

Figure 6. Grad CAM heatmap example 1 in the New College dataset (a–f): In scenes with trees and grass as the background, the impact of dynamic objects on feature extraction.

Figure 7. Grad CAM heatmap example 2 in the New College dataset (a–d): In scenes with walls as the background, the impact of dynamic objects on feature extraction.

Figure 8. Grad CAM heatmap example 1 in the City Center dataset (a–d): In scenes with trees and grass as the background, the impact of shadow occlusion on feature extraction.

Figure 9. Grad CAM heatmap example 2 in the City Center dataset (a–d): In scenes with trees as the background, the impact of occluding objects on feature extraction.

Figure 10. Flowchart of the image-based KL transform technique.

Figure 11. New College Dataset Image Sequence Example.

Figure 12. Visualization of transformed feature matrix.

Figure 13. Structure Diagram of NetVLAD based on ResNet152.

Figure 14. Example images from the New College dataset (a–d): Two scenes containing dynamic objects and occluding elements.

Figure 15. Example images from the City Center dataset (a–d): Two scenes containing dynamic objects and occluding elements.

Figure 16. Detected loop scenes and their adjacent images (a–d): Two scenes mistakenly recognized as loop closures.

Figure 17. The Ground Truth for New College and City Center.

Figure 18. Similarity-detection results without KL transformation (New College left, City Center right).

Figure 19. Similarity-detection results after KL transformation (New College left, City Center right).

Figure 20. Precision–recall curves of different convolutional layers in ResNet152 under various compression ratios in the New College dataset (a–d): Results at compression rates of 60%, 70%, 80%, and 90%.

Figure 21. Precision–recall curves of different convolutional layers in ResNet152 under various compression ratios in the City Center dataset (a–d): Results at compression rates of 60%, 70%, 80%, and 90%.

Figure 22. PR curves for each method in the New College dataset.

Figure 23. PR curves for each method in the City Center dataset.

Table 1. Summary of methods for loop detection.

Classification	Method	Advantages	Shortcomings
Traditional methods	BoW	1. Simple and easy to execute	1. Time-consuming
	BoW	1. Simple and easy to execute	2. Inability to adapt to a new environment
	Incremental BoW	1. Improved generalization ability of the model	1. Time-consuming
	Incremental BoW	1. Improved generalization ability of the model	2. Inability to adapt to a new environment
Deep Learning methods	CNN	1. Powerful image feature extraction capabilities	1. computationally intensive and occupies memory
	CNN	2. Strong generalization ability	2. susceptible to interference from dynamic objects
	CNN-AE	1. Data dimensionality reduction	1. Loss of information
	CNN-AE	2. Unsupervised, strong generalization ability	2. Overfitting
	CNN-PCA	1. Data dimensionality reduction	1. Loss of information
	CNN-PCA	2. Removal of redundant information	2. Difficult of setting parameters

Table 2. Introduction of the New College and City Centre datasets.

Dataset	Classification	The Number of Images	The Size	The Distance
New College	Left	1073	640 × 480	1900 m
New College	Right	1073	640 × 480	1900 m
City Centre	Left	1237	640 × 480	2000 m
City Centre	Right	1237	640 × 480	2000 m

Table 3. Experimental parameter settings.

Parameter	Value	Description
Image Size	224 × 224	Input to ResNet152
Layer	conv3, conv4, conv5	Convolutional layers of ResNet152
$γ$	[60%, 70%, 80%, 90%]	Compression ratio information
Image Sequence Length	10	Number of images in sequence
$α$	0.3	Similarity detection threshold

Table 4. Maximum recall rates of different convolutional layers in ResNet152 under various compression ratios at 100% accuracy in the New College dataset.

Layer	Max Recall at 100% Precision (%)
Layer	$γ$ = 60%	$γ$ = 70%	$γ$ = 80%	$γ$ = 90%
Conv1	14.62	14.23	12.69	9.41
Conv2	14.22	13.76	18.31	16.58
Conv3	25.34	26.86	30.21	28.17
Conv4	30.67	28.52	40.07	45.69
Conv5	27.88	24.39	46.55	60.14

Table 5. Maximum recall rates of different convolutional layers in ResNet152 under various compression ratios at 100% accuracy in the City Center dataset.

Layer	Max Recall at 100% Precision (%)
Layer	Ratio = 60%	Ratio = 70%	Ratio = 80%	Ratio = 90%
Conv1	33.64	39.72	37.88	29.17
Conv2	38.72	47.38	42.39	37.54
Conv3	48.13	46.75	50.21	47.68
Conv4	62.54	71.42	69.01	63.87
Conv5	75.65	75.09	73.42	73.38

Table 6. Maximum recall rates for each method at 100% precision.

Method	Max Recall at 100% Precision (%)
Method	New College	City Center
SeqSLAM	42.39	47.16
FAB-MAP	49.10	52.09
ResNet152_NetVLAD	54.38	58.23
VGG_NetVLAD	52.11	44.26
Our method	60.14	73.38
Faster RCNN	42.57	36.85

Table 7. Time-consumption statistics of feature-extraction algorithms.

Method	Time Consumption
Method	Total Time (/s)	Average Time (/ms)
SeqSLAM	64.32	32.16
FAB-MAP	88.54	44.27
ResNet152_NetVLAD	35.90	17.95
VGG_NetVLAD	49.78	24.89
Our method	27.46	13.73
Faster RCNN	23.24	11.62

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, S.; Zhou, Z.; Sun, S. Loop Closure Detection Based on Compressed ConvNet Features in Dynamic Environments. Appl. Sci. 2024, 14, 8. https://doi.org/10.3390/app14010008

AMA Style

Jiang S, Zhou Z, Sun S. Loop Closure Detection Based on Compressed ConvNet Features in Dynamic Environments. Applied Sciences. 2024; 14(1):8. https://doi.org/10.3390/app14010008

Chicago/Turabian Style

Jiang, Shuhai, Zhongkai Zhou, and Shangjie Sun. 2024. "Loop Closure Detection Based on Compressed ConvNet Features in Dynamic Environments" Applied Sciences 14, no. 1: 8. https://doi.org/10.3390/app14010008

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Loop Closure Detection Based on Compressed ConvNet Features in Dynamic Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Feature Extraction by the Backbone Network

2.2. Feature Compression

2.3. Image Similarity Calculation

3. Results

3.1. Introduction to the Dataset

3.2. Evaluation Criteria

3.3. Experimental Comparison

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI