This chapter provides a detailed description of the overall process of the loop closure-detection method proposed in this paper. The entire process can be outlined as follows:
2.1. Feature Extraction by the Backbone Network
The Siamese Network [
31] is a sample learning method based on similarity measurement that has achieved significant success in various fields such as scene recognition and audio processing. This network has a simple structure, making it easy to train, and consists of two identical backbone networks as sub-networks. The network takes two image sequences as input and outputs a pair of similar images based on a similarity-detection method applied to features extracted by the backbone network. The network architecture is illustrated in
Figure 2.
Here, and are the input parameters of the network, represents two identical convolutional neural networks, i.e., sub-networks, where denotes the shared network weights across the two sub-networks and is the similarity score of the outputs from the two networks. This network is a feedforward network. In image similarity detection, each input consists of two different images and a label. The label is encoded in binary form, where a label of 0 indicates that the two images are dissimilar, and a label of 1 indicates that the two images are similar. The input images undergo feature extraction by convolutional neural networks, and the extracted features are flattened to output feature vectors. The distance between these two feature vectors is then calculated and fed into the loss function. The shared network weights are optimized through the loss function. Throughout this process, the features of the two images are mapped to a new subspace where positive samples are close to each other and negative samples are far apart, completing the optimal matching.
For image-classification tasks, the number of neurons in the output layer of a neural network is typically equal to the number of categories. However, in the research presented in this paper, only similarity detection is required, and classification is not considered. The fully connected layer can be directly input into the loss function for similarity calculation. Generally, Euclidean distance or cosine distance is used, and there is no restriction on the dimensions of the output from the fully connected layer. Therefore, there are many choices for the backbone network of the Siamese neural network in this project, including classical models like VGG16, AlexNet, GoogleLeNet, ResNet, Inception-v4, etc. Canziani, A. et al. [
32] conducted Top1 accuracy tests on some commonly used CNN network models using practical applications, and the test results are shown in
Figure 3.
From the graph, it can be observed that, compared to various CNN model architectures, ResNet-34, ResNet-101, and ResNet-152 outperform some earlier networks. This improvement occurred because the early CNN models improved performance by simply increasing the width and depth of the network, with the VGG in
Figure 3 designed based on the AlexNet using this approach. However, merely increasing the number of layers in the network does not necessarily lead to better performance for CNNs. Once the number of layers in the network reaches a certain quantity, the network’s performance tends to saturate. Further increasing the number of layers at this point can lead to issues like gradient explosion and degradation, thereby reducing the model’s performance. This phenomenon is not caused by overfitting, but rather is due to the increase in computational cost and parameter requirements as the number of network layers grows, making the model challenging to train. By contrast, deeper networks can capture more information from images, resulting in relatively richer features. Especially in complex dynamic environments, the network should differentiate between different scenes as much as possible and recognize the same scene when it contains moving objects and occlusions.
ResNet [
33] is a convolutional neural network with a residual structure proposed by Kaiming He and his colleagues from Microsoft Research and based on VGG19. The initial purpose of designing this network was to address the issues of gradient explosion and degradation caused by the continuous increase in the number of intermediate layers (convolutional layers, pooling layers, and subsampling layers) in traditional convolutional neural networks. The residual structure is inspired by shortcut connections in highway networks, introducing a “skip connection” to form a learning structure, as illustrated in
Figure 4. The core idea is to stack new layers onto a shallow CNN, creating a deep CNN. For a newly added layer, assuming the input is x and the image features extracted based on
x in the current layer are denoted as
, theoretically, the output of the features by this layer would be as follows:
Here, represents the output features and the residual corresponds to the difference between the input and output, i.e., . Now consider an extreme case wherein the new layer has not learned any features. the residual should be zero in this scenario. Consequently, the layer would directly output the original features through the skip connection, making the new layer essentially perform an “identity mapping” relative to the old layer. However, in practical situations, the residual is generally small but not directly zero. This feature makes training the residual easier by this method compared to training by traditional methods.
For ResNet with different numbers of layers, the generally adopted residual modules consist of the two types of residual structures illustrated in
Figure 4. The left-side residual module is utilized for shallow ResNet architectures such as ResNet18, ResNet34, and ResNet50. In contrast, the right-side residual module is employed for deeper ResNet architectures like ResNet101 and ResNet152, which take into consideration the dimensionality-reduction requirement. Given the remarkable performance of ResNet, this paper employs ResNet152 as the backbone sub-network for feature extraction, constructing a Siamese Network architecture. The network structure can be broadly categorized into the image input layer, pooling layer, convolutional layers, and feature output layer, as depicted in
Figure 5.
The dashed-line portion represents the intermediate convolutional layers, where the parameters denote the number of channels, width, and height for each output feature map. The input section consists of a convolutional layer with a kernel size of 7 × 7 and a stride of 2, followed by a pooling layer with a kernel size of 3 × 3 and a stride of 2, primarily for dimensionality reduction of the input image. The middle section comprises layers 1 to 4, where convolutional layers of the same color form a layer. Layer 1 consists of 3 residual modules; layer 2 has 4 residual modules; layer 3 contains 36 residual modules; and layer 4 includes 3 residual modules. Each residual module is composed of two convolutional layers with a size of 1 × 1 and a stride of 1 and one convolutional layer with a size of 3 × 3 and a stride of 1. The residual modules are interconnected through skip connections. Ultimately, given an input image of size 224 × 224 with 3 channels, layer 1 produces feature maps of size 56 × 56. The output size decreases by half in each subsequent layer, resulting in a final feature-map size of 7 × 7 × 2048. As max-pooling layers significantly reduce the precision of image features, making it unsuitable for similarity detection in dynamic scenes with rich features, direct use of such large feature-map dimensions is computationally inefficient. Additionally, it contains many abstract invariant details, leading to interference from redundant information in dynamic environments during similarity detection.
Therefore, in our work, we will focus on the feature maps output by each layer of the ResNet rather than directly processing the top-layer output. Considering the case of the output for a single image, let
represent the selected convolutional layer. The feature map output for the current layer can be denoted as
:
Here,
and
represent the width and height of the image, respectively, and
is the total number of layers in the network. The feature map
is flattened into a one-dimensional vector to construct the image feature descriptor:
Here,
represents the total number of output feature maps for the
-th convolutional layer. Now, we obtain the feature vectors of the
-th image in the sequence from the
-th layer of ResNet152. Keeping
unchanged, by combining the feature vectors of all feature maps, we can obtain the feature matrix
:
can be considered as an overall descriptor for the image features. However, due to the interference of redundant information in dynamic environments, the accuracy of in similarity detection may decline. The redundant information in the feature maps can be attributed to the presence of dynamic objects, environmental changes, and factors like occlusion, where the output loop after similarity detection often represents the same scene. In different periods of a day, variations in background, moving objects, and lighting conditions within the same scene can lead to changes in the features extracted from the images.
Based on these considerations, the authors of this study chose to investigate the New College and City Center datasets. The images collected in the New College dataset contain numerous dynamic objects, primarily people, with most appearing in the same scene. The City Center dataset primarily captures images of communities and roads, featuring a variety of static objects such as cars, pedestrians, and obstacles. Detailed information about the datasets will be presented in the experimental section.
To analyze the differences in features extracted from the same scene in dynamic environments, this study utilizes Grad CAM (Gradient-weighted Class Activation Mapping) [
34] to visualize the feature maps output by ResNet152. This process involves converting the network’s output feature maps into heatmap representations, allowing observation of the regions that the network focuses on.
Figure 6 and
Figure 7 demonstrate the impact of people as moving objects in the same scene on the network’s feature maps in the New College dataset.
In
Figure 6, the red area indicates the part of the image that is of interest to the network, while the blue area represents the part that the network is not interested in. From
Figure 6a,b, it can be observed that the pedestrian area in the image is of interest, and the network treats the trees as part of the background of the image. In
Figure 6c,d, the pedestrian blocks are not significantly activated, and the network treats the grass as the image background. In
Figure 6e,f, only a small portion of the pedestrian blocks are activated, and most of them remain unactivated.
From
Figure 7a–d, it can be seen that the influence of pedestrians as moving objects is more pronounced, and the pre-trained network focuses more on the architectural features in the scene, such as windows, walls, etc., paying relatively little attention to pedestrians.
Figure 8 and
Figure 9 illustrate the impact of environmental changes and occlusions on network feature maps in the City Center dataset within the same scene.
From
Figure 8a–d, it is evident that the network pays more attention to scene features with rich lighting conditions, while the shadowed areas are treated as the background. The shadowed area in
Figure 8a is significantly larger than that in
Figure 8c, resulting in different focus positions in the two images. In similarity detection, this difference might lead to misclassification of the images as non-similar objects. In
Figure 9a–d, the effect of bicycles as occluding objects is more pronounced. It can be observed that due to the presence of occluding objects, the network’s focus on features changes direction.
In summary, the pre-trained network is capable of recognizing some background objects in the scene, such as trees, walls, windows, and railings. In the context of visual SLAM, these background objects can serve as landmarks, which is advantageous for loop detection. However, the disadvantages of this network arise due to changes in lighting conditions, the movement of pedestrians, and the appearance of occluding objects. These factors lead to variations in the network’s focus on specific features, including pedestrian features and redundant features detected due to changes in the background. Both of these aspects, as redundant information in the image feature region, cause significant interference in subsequent similarity detection, thereby hindering loop detection.
2.2. Feature Compression
In dynamic environments, selecting appropriate output feature maps for convolutional layers from a CNN is crucial. The closer the chosen convolutional layer is to the input, the more internal redundant information it contains. For ResNet152, when dealing with a large number of test samples, extracting high-level feature representations for low-resolution input images involves many abstract uncertainties. In loop detection, the goal is for the system to accurately identify previously visited scenes in dynamic environments, a task susceptible to such uncertainties. S. Bannour et al. [
35] used a recursive least squares-type algorithm for sequential training of feature extraction in images, extracting principal components of the input part. They successfully demonstrated that a certain dimensionality-reduction method could compress feature information in color images, addressing the aforementioned issues. Following this approach, our study employs KL transformation to preprocess the output feature maps from a CNN, aiming to compress and retain crucial information in the image while minimizing the impact of non-essential information on subsequent image similarity detection.
The KL transformation [
36] calculates the correlation between various components in the input and compresses the input information based on the eigenvalues and eigenvectors of the covariance matrix. Therefore, the KL transformation can preserve the main features of the original image while extracting and combining key information, forming new feature information where each element is uncorrelated. This process allows for a reduction in image dimensions while preserving essential features and, simultaneously, removing noise from the image information. The flowchart of the KL transformation technique as applied to images is illustrated in
Figure 10.
Through the feature extraction conducted by the backbone network, as discussed in
Section 2.1, we obtained vector representations
for the output feature maps of each convolutional layer. The size was set as
, where
represents the number of feature maps output by convolutional layer
, and
denotes the dimensions of a single image. However, as outlined in the preceding sections, we utilized a Siamese neural network to simultaneously detect two sequences of images rather than directly employing individual images. Therefore, considering the feature representations for all images within a sequence, we constructed their corresponding feature vectors:
Here,
denotes the number of images in the sequence. Considering all the images within the sequence, the feature matrix formed by the feature vectors of convolutional layer
output is given by:
From Equation (6), it can be seen that is a large-scale feature matrix. Its rows represent all the feature maps of a single image output by the current convolutional layer, and its columns represent the feature maps of all images at the corresponding position in the output sequence of the current convolutional layer. It can be observed that contains rich feature information, including interference from dynamic objects and occlusions. Directly processing it would significantly increase computation time and consume memory resources.
Next, we applied the KL transform to reduce dimensionality and remove noise from
. First, we blockwise partitioned
column-wise to obtain
After partitioning
, the mean was calculated to obtain
Combining Equations (7) and (8), the difference between
and
E(
) can be obtained as
The covariance matrix of
is defined as
where
is a
m ×
m positive definite symmetric matrix. There exist
mutually orthogonal eigenvectors
and corresponding eigenvalues
such that
To reduce the dimensionality of
, let
be the index of the selected eigenvectors. The key to using the KL transformation is as follows:
where
must be smaller than the value of
; otherwise, compression cannot be performed.
is the new matrix obtained by applying the transformation matrix
to
. This matrix contains fewer elements than
and is of size
, and its eigenvectors corresponding to different eigenvalues are linearly independent with minimum variance. In the KL transform, each eigenvector of the matrix is called a principal component, where the eigenvector corresponding to the largest eigenvalue is the first principal component and the
-th eigenvalue
measures the coherent energy magnitude of the
-th principal component. The energy of the input features can be represented by the eigenvalues of the covariance matrix:
Equation (15) expresses the total energy captured by the selected principal components. The KL transformation allows capturing the most significant features while discarding less relevant information.
In Equation (15),
is the original length of the eigenvectors and
is the length of the eigenvectors to be truncated. This equation indicates that to find the linear transformation matrix
, when representing
eigenvectors, the top
larger eigenvalues should be selected in descending order. As
represents the magnitude of variance, for the newly transformed features, we aim for larger variances. Information content reflects the complexity and diversity of the data and is thus directly related to its uncertainty. Variance measures the deviation between a random variable and its mean; smaller deviations indicate smaller variances, higher correlation between pieces of information, and greater repetitiveness. By contrast, larger variances suggest a lower correlation between pieces of information, the presence of more information, and greater distinguishability between different samples. This characteristic is necessary for loop detection. In our research work, using
directly is not conducive to understanding and cannot serve as a precise standard. Therefore, based on Equation (15), we introduce the compression ratio
:
where the compression ratio
is the percentage form of the ratio between the length of the truncated eigenvectors and the length of the original eigenvectors. Setting different compression ratios can allow measurement of the dimensionality-reduction capability of KL transformation in different scenarios. To validate the feasibility of the above method, we took images from the New College dataset as an example, as shown in
Figure 11.
The feature maps extracted from the conv3 of ResNet152 can be represented in matrix form as follows:
Applying KL transformation to map
to a low-dimensional subspace, and flattening it into a one-dimensional vector:
From
, it can be observed that as the index increases, the elements in the vector gradually decrease in magnitude. The result of visualizing this phenomenon is shown in
Figure 12. It can be seen that the main components we need to retain are primarily concentrated at the front, while the later part contains redundant information to be removed. Therefore, we can select the first
k principal components for feature compression.
2.3. Image Similarity Calculation
In the similarity-calculation phase, it is necessary to compare the similarity of two images through a similarity-measurement method. In Siamese neural networks, the image features extracted by the convolutional neural network as a sub-network are unordered, disregarding the spatial-position information of the features. NetVLAD [
37] is a feature-encoding method that enhances the interrelation of features to improve the recognition ability of neural networks. It can capture the relative distribution of features in space. When combined with features compressed by KL transformation, it generates descriptors with stronger representational and anti-interference capabilities.
The training objective of NetVLAD is to solve and optimize the positions of cluster centers and the weights of each descriptor belonging to a cluster center after the KL transformation compression. It calculates the weighted residual between the feature map descriptor and the cluster centers, using it as the image descriptor vector. This vector is designed to have the greatest similarity with other descriptors with minimal cosine distance during image similarity detection. The NetVLAD structure based on ResNet152 is illustrated in
Figure 13.
Select the output of the convolutional layers from ResNet152, and after the features are compressed through KL transformation, the feature dimensions become W × H × D. Subsequently, W × H × D-dimensional local descriptors are provided as an input to the NetVLAD layer, with K × D cluster centers from the clustering algorithm as parameters. The computation of the elements in the image description matrix
involves calculating the weighted residuals of W × H × D features concerning the cluster centers, and the formula for
is as follows:
where
represents the
-th feature value of the
-th local feature and
denotes the
-th feature value of the
-th cluster center. The computation involves calculating the distance weights from each local feature to the cluster centers, adjusting the weights of local feature descriptors
under each cluster to be in the range of 0–1. Higher weights indicate that the feature is closer to a certain cluster center. Finally,
is transformed into vector form, normalized, and output as the image’s feature description vector, resulting in a VLAD vector of length K × D. This vector serves as the original image’s feature descriptor.
The network architecture designed in this paper takes images as input in the form of a sequence, and the network output is a set of feature vectors for a sequence of images, where each element corresponds to the VLAD feature descriptor for the respective image. Therefore, in similarity detection, we directly compute the similarity between two sequences. If the distance between two sequences is close, that result indicates the presence of loops in the sequence, and the higher the similarity between the sequences, the more loops may exist. Conversely, a lower similarity suggests that there are fewer or no loops in the sequence.
Let the input sequence at the current time be
and the selected historical sequence at a previous time be
, both sequences having a length of
. The corresponding VLAD feature vectors for the two sequences are denoted as
and
:
The standard Euclidean distance is used to calculate the distance between two vectors, as follows:
During the retrieval process, the current image sequence is chosen as the input for the first backbone network and the historical image sequence is chosen as the input for the second backbone network. The distance between the sequences is then calculated. A threshold is set so that sequences with distances below a certain threshold are detected as loops. Adjusting the threshold α allows for a balance between precision and recall.