*2.1. Semantic Topology Graph*

The construction of the semantic topology graphs is the basis of loop closure detection, so this section first introduces the construction process of the semantic topology graphs (see Figure 2). *2.1. Semantic Topology Graph* The construction of the semantic topology graphs is the basis of loop closure detection, so this section first introduces the construction process of the semantic topology graphs (see Figure 2).

**Figure 2.** Flow chart for constructing semantic topology graph. **Figure 2.** Flow chart for constructing semantic topology graph.

For each obtained image, semantic segmentation is performed to extract landmarks. After preprocessing, the image is divided into landmark regions and contours. The obtained landmarks are selected and sent to AlexNet to extract third convolutional layer (Conv3) features. Then, Gaussian random projection is used to reduce the dimensionality of the feature vectors, and low-dimensional feature vectors are output. In addition, the Hu moment is calculated according to the obtained contours. At the same time, the preprocessed semantic segmentation images are used to extract the For each obtained image, semantic segmentation is performed to extract landmarks. After preprocessing, the image is divided into landmark regions and contours. The obtained landmarks are selected and sent to AlexNet to extract third convolutional layer (Conv3) features. Then, Gaussian random projection is used to reduce the dimensionality of the feature vectors, and low-dimensional feature vectors are output. In addition, the Hu moment is calculated according to the obtained contours. At the same time, the preprocessed semantic segmentation images are used to extract the center of the landmark region as a node, and the landmarks seen from the same viewpoint are connected by undirected edges to establish a semantic topology graph according to the co-visibility information.

Finally, a random walk descriptor is exploited to describe the topological graph structure. Here, the images obtained before the loop closure detection are called dataset images.

#### 2.1.1. Landmark Extraction

Previous methods [30,31] have employed object proposal [42] to extract landmark regions even though it contains a lot of irrelevant feature information. In particular, the proposed method adopts semantic segmentation to extract landmarks, and this can accurately obtain the range of landmark regions.

DeepLabV3 + [43] is one of the most influential semantic segmentation models. It is better than FCN [44], U-Net [45], and SegNet [46] for some datasets, and it is widely used in the field of engineering technology. The ADE20K dataset [47,48] covers a wide range of scenes and object categories. Furthermore, it provides dense annotations, so it is used to train the DeepLabV3 +. The pre-trained DeepLabV3 + model can be applied for extracting landmarks.

Here, DeepLabV3 + was used to fuse the shallow features outputted by the encoder with the deep features generated from the ASPP module so that it could produce high-precision semantic segmentation results.

#### 2.1.2. Landmark Selection

2.1.3. CNN Features

global feature in the landmark region.

Figure 4e) to the CNN feature to describe the landmark.

Due to the effects of illumination and dynamic disturbance in the images obtained by the robot, as well as the inherent defects of the semantic segmentation model, there was a lot of noise, as well as dynamic and secondary landmarks, in the semantic segmentation image that was obtained via the model discussed in Section 2.1.1. To overcome these problems, the landmarks were preprocessed to obtain significant landmark regions. Then, the dynamic regions from the landmarks obtained by preprocessing were removed, and the distinctive patches were selected.

As shown in Figure 3, the semantic segmentation image (see Figure 3b) was filtered to remove the regions; its area was less than the specified threshold (the threshold was 100 in this paper). Figure 3c was obtained by merging region filtered out with the surrounding area. Through the above-discussed procedures, the secondary landmarks and holes were filtered out to obtain the obvious landmark region with clear boundaries. *Remote Sens.* **2020**, *12*, x 7 of 27

**Figure 3.** Selection of the landmarks: (**a**) raw image, (**b**) semantic segmentation image, (**c**) the result of filtering, and (**d**) the result of eliminating pedestrian dynamic landmarks. **Figure 3.** Selection of the landmarks: (**a**) raw image, (**b**) semantic segmentation image, (**c**) the result of filtering, and (**d**) the result of eliminating pedestrian dynamic landmarks.

The AlexNet network architecture had 8 layers, of which the first 5 layers were convolutional layers and the last 3 layers were fully connected layers. There was a pooling layer after the 1st, 2nd, and 5th convolutional layers, but there was none after the 3rd and 4th convolutional layers. Each convolutional layer had activation function ReLU and normalization. The input of the network was a 227 × 227 3-channel image, and the output feature of the third convolutional layer was 13 × 13 × 384 = 64.896. According to the research of [27], the output features of the third convolutional layer of AlexNet perform best under appearance changes. We found that the output features of the fully connected layer had strong semantic information that was robust to viewpoint changes but poor for appearance changes. At the same time, it was proved that the AlexNex obtained by pre-training in the object recognition task was better than the CNN model based on place recognition training when considering the characteristics of the entire image under the viewpoint changes. Other advanced networks such as VGG, ResNet, and DenseNet have complex architectures, as well as a lack of research and utilization in the field of loop closure detection. Therefore, this article used the relatively lightweight and mature AlexNet in the loop closure detection field to extract CNN features. Based on the above research, the proposed method employed the output of the Conv3 of the AlexNet as the

The landmark proposal extracted by the object proposal method contained a large amount of irrelevant feature information. This led to a certain amount of noise influence in the CNN feature description. However, the landmark area extracted by semantic segmentation in this paper only contained the landmark feature and no other unrelated features. Figure 4a–c is introduced in Section 2.1.2, this section explains the landmark area and contour extraction. The landmarks from the filtering result (see Figure 4c) were selected to get Figure 4d, and the contour binary (Figure 4e) of the corresponding landmark was obtained by a Canny operator. Then, the landmarks (see Figure 4d) were resized to 227 × 227 pixels and input to the pre-trained AlexNet to extract features. As a result, the features of each landmark could be represented by a 64.896-dimensional vector. In order to keep the original size information of the landmark, this paper added the Hu moment of the contour (see

the network was pre-trained for object recognition tasks on the ILSVRC dataset [50].

In order to overcome the impact of dynamic scenes, the semantic information of landmarks was then used to eliminate the pedestrian dynamic landmarks. At the same time, the pedestrians and long-term parking car region were merged, and the merged area could be used as car landmarks in subsequent work. Figure 3d was obtained by the above-mentioned operations. After excluding dynamic landmarks, in the follow-up loop closure detection, the dynamic landmarks were no longer matched. Furthermore, the number of pixels could be calculated for the landmark region. The distinctive landmarks were selected according to the number of landmark pixels and semantic information combined with experimental scenes for loop closure detection. In addition, dynamic landmarks were determined by scene content and landmark semantic information. In other words, according to the movement status of the landmarks in each experimental scene, we removed the moving landmarks in the dataset images and prevented them from participating in subsequent experiments. Formally, we denoted t as the number of distinctive landmarks selected in the image. (t was 5 or 10 in this work).

#### 2.1.3. CNN Features

CNN features have appearance invariance, so they far surpass manual features in the field of image retrieval and classification. AlexNet [49] won the 2012 ImageNet competition champion, and the network was pre-trained for object recognition tasks on the ILSVRC dataset [50].

The AlexNet network architecture had 8 layers, of which the first 5 layers were convolutional layers and the last 3 layers were fully connected layers. There was a pooling layer after the 1st, 2nd, and 5th convolutional layers, but there was none after the 3rd and 4th convolutional layers. Each convolutional layer had activation function ReLU and normalization. The input of the network was a 227 × 227 3-channel image, and the output feature of the third convolutional layer was 13 × 13 × 384 = 64.896. According to the research of [27], the output features of the third convolutional layer of AlexNet perform best under appearance changes. We found that the output features of the fully connected layer had strong semantic information that was robust to viewpoint changes but poor for appearance changes. At the same time, it was proved that the AlexNex obtained by pre-training in the object recognition task was better than the CNN model based on place recognition training when considering the characteristics of the entire image under the viewpoint changes. Other advanced networks such as VGG, ResNet, and DenseNet have complex architectures, as well as a lack of research and utilization in the field of loop closure detection. Therefore, this article used the relatively lightweight and mature AlexNet in the loop closure detection field to extract CNN features. Based on the above research, the proposed method employed the output of the Conv3 of the AlexNet as the global feature in the landmark region.

The landmark proposal extracted by the object proposal method contained a large amount of irrelevant feature information. This led to a certain amount of noise influence in the CNN feature description. However, the landmark area extracted by semantic segmentation in this paper only contained the landmark feature and no other unrelated features. Figure 4a–c is introduced in Section 2.1.2, this section explains the landmark area and contour extraction. The landmarks from the filtering result (see Figure 4c) were selected to get Figure 4d, and the contour binary (Figure 4e) of the corresponding landmark was obtained by a Canny operator. Then, the landmarks (see Figure 4d) were resized to 227 × 227 pixels and input to the pre-trained AlexNet to extract features. As a result, the features of each landmark could be represented by a 64.896-dimensional vector. In order to keep the original size information of the landmark, this paper added the Hu moment of the contour (see Figure 4e) to the CNN feature to describe the landmark.

The obtained high-dimensional vector contained redundant landmark feature information, and a large amount of computational cost was required to calculate landmark similarity. Due to the real-time requirements of visual SLAM, the Gaussian random projection method [51] utilized in [30] was employed to reduce the dimensionality of the feature vector to 2048 dimensions.

*Remote Sens.* **2020**, *12*, x 8 of 27

**Figure 4.** Extraction of landmark regions and contours: (**a**) raw image, (**b**) semantic segmentation image, (**c**) the result of filtering, (**d**) the selected landmark regions, and (**e**) the contours of the selected landmarks. **Figure 4.** Extraction of landmark regions and contours: (**a**) raw image, (**b**) semantic segmentation image, (**c**) the result of filtering, (**d**) the selected landmark regions, and (**e**) the contours of the selected landmarks.

#### The obtained high-dimensional vector contained redundant landmark feature information, and 2.1.4. Graph Representation

a large amount of computational cost was required to calculate landmark similarity. Due to the realtime requirements of visual SLAM, the Gaussian random projection method [51] utilized in [30] was employed to reduce the dimensionality of the feature vector to 2048 dimensions. 2.1.4. Graph Representation In order to preserve the spatial relationship of the scene, a semantic topology graph was constructed from a single image obtained by the robot. When the robot was initialized, the camera captured the first image and started to create the semantic topology graph. In this paper, each landmark was abstracted as a node containing category and pixel number information. Additionally, the node was located at the center of the landmark region.

In order to preserve the spatial relationship of the scene, a semantic topology graph was constructed from a single image obtained by the robot. When the robot was initialized, the camera captured the first image and started to create the semantic topology graph. In this paper, each landmark was abstracted as a node containing category and pixel number information. Additionally, the node was located at the center of the landmark region. The truncated random walk proposed by Perozzi et al. [52] was used to describe the semantic topological graph and represented each node as a fixed-length embedding vector. In order to enrich the feature expression, the node index and the number of pixels were adopted to describe the nodes. The node index was obtained according to the 150 semantic categories of the ADE20K dataset, and the number of pixels was acquired from the landmark region.

The truncated random walk proposed by Perozzi et al. [52] was used to describe the semantic topological graph and represented each node as a fixed-length embedding vector. In order to enrich the feature expression, the node index and the number of pixels were adopted to describe the nodes. The node index was obtained according to the 150 semantic categories of the ADE20K dataset, and the number of pixels was acquired from the landmark region. Then, the random walk descriptor of each node was calculated, and each node was used as the target node. In the semantic topological graph, the next adjacent node was randomly selected until the walking depth was reached and a random walking path was obtained. In this paper, random walks were performed on each node to obtain random walk paths. Finally, each target node could be expressed as a matrix = {} ∈ ℝ× . Since each node contained information about the index and the number of pixels, a matrix = {} ∈ ℝ×2 was finally obtained. In Then, the random walk descriptor of each node was calculated, and each node was used as the target node. In the semantic topological graph, the next adjacent node was randomly selected until the walking depth *n* was reached and a random walking path was obtained. In this paper, *m* random walks were performed on each node to obtain *m* random walk paths. Finally, each target node could be expressed as a matrix *M* = n *<sup>m</sup>ij*<sup>o</sup> <sup>∈</sup> R*m*×*<sup>n</sup>* . Since each node contained information about the index and the number of pixels, a matrix *M* = n *<sup>m</sup>ij*<sup>o</sup> <sup>∈</sup> R*m*×2*<sup>n</sup>* was finally obtained. In addition, the random walks followed certain rules, i.e., they would not repeat the same path and would not return to the previous nodes during each walk. The selection of *m* and *n* was related to the number of nodes in the graph structure. Based on the research of Gawel et al. [36] and the number of landmarks selected in Section 2.1.2, *m* was selected as 10, 20, or 50, and *n* was 3 or 5 for experiments.

Figure 5 shows an example of a constructing descriptor. After the image in Figure 5a was obtained by the robot, the landmark nodes were extracted to obtain Figure 5b. Then, the nodes in Figure 5b were connected by undirected edges according to the co-visibility information to get Figure 5c. The semantic topology graph (see Figure 5d) was used to describe the geometric connection of the nodes in Figure 5a. Furthermore, a random walk graph descriptor (see Figure 5e) was constructed according to the semantic topology graph (see Figure 5d). The blue node (car) was used here to construct a descriptor for the target node. For the purpose of illustration, a graph description matrix *<sup>M</sup>* <sup>∈</sup> R5×<sup>6</sup> was made to represent the geometric characteristics of the image by five random walks with a depth of 3 each time. The last row of Figure 5e corresponds to the random walk path shown by the black arrow in Figure 5d. Figure 5f used a matrix to quantify the description of Figure 5e, where the red box represents the index of nodes and the number of pixels. The index of the target node car was 21, and 68,192 was the number of pixels contained in the car landmark. In the same way, 2 and 112,918 were the index and pixel number of the building node, respectively; 3 and 13,488 were the index and pixel number of the sky node, respectively; 5 and 10,697 were the index and pixel number of the tree node, respectively; and 7 and 101,572 were the index and pixel number of the road node, respectively. For visualization, some node information was omitted. experiments. Figure 5 shows an example of a constructing descriptor. After the image in Figure 5a was obtained by the robot, the landmark nodes were extracted to obtain Figure 5b. Then, the nodes in Figure 5b were connected by undirected edges according to the co-visibility information to get Figure 5c. The semantic topology graph (see Figure 5d) was used to describe the geometric connection of the nodes in Figure 5a. Furthermore, a random walk graph descriptor (see Figure 5e) was constructed according to the semantic topology graph (see Figure 5d). The blue node (car) was used here to construct a descriptor for the target node. For the purpose of illustration, a graph description matrix ∈ ℝ5×6 was made to represent the geometric characteristics of the image by five random walks with a depth of 3 each time. The last row of Figure 5e corresponds to the random walk path shown by the black arrow in Figure 5d. Figure 5f used a matrix to quantify the description of Figure 5e, where the red box represents the index of nodes and the number of pixels. The index of the target node car was 21, and 68,192 was the number of pixels contained in the car landmark. In the same way, 2 and 112,918 were the index and pixel number of the building node, respectively; 3 and 13,488 were the index and pixel number of the sky node, respectively; 5 and 10,697 were the index and pixel number of the tree node, respectively; and 7 and 101,572 were the index and pixel number of the road node, respectively. For visualization, some node information was omitted.

*Remote Sens.* **2020**, *12*, x 9 of 27

would not return to the previous nodes during each walk. The selection of and was related to

of landmarks selected in Section 2.1.2, *m* was selected as 10, 20, or 50, and was 3 or 5 for

**Figure 5.** Construction of the topology graph and extraction of the descriptor: (**a**) raw image, (**b**) creation of nodes, (**c**) connection of undirected edges, (**d**) construction of topology graph, (**e**) random walk descriptor, and (**f**) descriptor matrix. **Figure 5.** Construction of the topology graph and extraction of the descriptor: (**a**) raw image, (**b**) creation of nodes, (**c**) connection of undirected edges, (**d**) construction of topology graph, (**e**) random walk descriptor, and (**f**) descriptor matrix.

#### *2.2. Loop Closure Detection 2.2. Loop Closure Detection*

This section introduces the algorithm of loop closure detection, and its flowchart is shown in Figure 6. Firstly, dataset images (candidate images) that matched the current image (query image) were retrieved. Then, the appearance similarity was calculated according to the CNN and contour features between images. In addition, geometric similarity was obtained by using the random walk descriptor. Finally, loop closure detection was performed according to the overall similarity. This section introduces the algorithm of loop closure detection, and its flowchart is shown in Figure 6. Firstly, dataset images (candidate images) that matched the current image (query image) were retrieved. Then, the appearance similarity was calculated according to the CNN and contour features between images. In addition, geometric similarity was obtained by using the random walk descriptor. Finally, loop closure detection was performed according to the overall similarity.

#### 2.2.1. Obtain Candidate Images

When the robot enters a previous environment again, it needs to retrieve the candidate images from the dataset images. In this article, by controlling the number of landmarks that shared the same label between the query image and each image in the dataset, candidate images of the current query image were obtained. The smaller the number of shared nodes, the more candidate images and the longer the retrieval time, and vice versa. A reasonable setting of the number of shared nodes can improve the speed and accuracy of loop closure detection, so the number of shared nodes was set as 1 to obtain candidate images. In other words, when both the query image and an image in the dataset images have a landmark with the same label, the dataset image is considered to be one candidate image. According to the same principle, each query image can obtain a candidate image set, and each image *Remote Sens.*  in the candidate image set has at least one landmark (node) with the same label as the query image. **2020**, *12*, x 10 of 27

**Figure 6.** Flow chart of loop closure detection. **Figure 6.** Flow chart of loop closure detection.

#### 2.2.1. Obtain Candidate Images 2.2.2. Appearance Similarity

image.

When the robot enters a previous environment again, it needs to retrieve the candidate images from the dataset images. In this article, by controlling the number of landmarks that shared the same label between the query image and each image in the dataset, candidate images of the current query image were obtained. The smaller the number of shared nodes, the more candidate images and the longer the retrieval time, and vice versa. A reasonable setting of the number of shared nodes can improve the speed and accuracy of loop closure detection, so the number of shared nodes was set as 1 to obtain candidate images. In other words, when both the query image and an image in the dataset To calculate the appearance similarity between the candidate image and the current query image, it is necessary to match the landmark of the query image with all landmarks of the candidate image. By using the semantic information of landmarks, we employed the nearest neighbor search based on the cosine distance (see Equation (1)) of CNN features to match the landmark pairs of the same label in the two images so that only the landmarks of the same category were matched to speed up the matching process. In the matching process, we used a bidirectional matching method, i.e., landmark pairs were accepted only if they were mutual matches.

$$d\_{ij}^{\text{coosine}} = \frac{1}{2} \left( 1 - \frac{v\_i^q \cdot v\_j^c}{\|v\_i^q\| \|2\| \cdot v\_j^c\|\_2} \right) \tag{1}$$

) (1)

2.2.2. Appearance Similarity where *v q i* denotes the feature vector of the *i*-th landmark of the query image and *v c j* describes the feature vector of the *j*-th landmark of the candidate image.

To calculate the appearance similarity between the candidate image and the current query image, it is necessary to match the landmark of the query image with all landmarks of the candidate image. By using the semantic information of landmarks, we employed the nearest neighbor search based on the cosine distance (see Equation (1)) of CNN features to match the landmark pairs of the same label in the two images so that only the landmarks of the same category were matched to speed While calculating the similarity of the CNN features, the geometric shape of the landmark was introduced as a penalty factor to eliminate the false positive phenomenon, i.e., the CNN features were similar but the contours were different. The references [30,31,53] used the difference between the long side and the wide side of the region proposal of the landmark pair to measure the shape difference. However, because they only used the long side and wide side difference of the bounding

 ⋅ 

∥ ∥2∥ ∥2

1 2 (1 −

 cosine =

up the matching process. In the matching process, we used a bidirectional matching method, i.e.,

box, the influences of the scale and rotation was omitted. When the viewpoint changed drastically, the rotation of the landmark caused a large change in the aspect ratio of the bounding box. In the end, the shape penalty factor was too large, resulting in a low appearance similarity.

In order to solve the above problems, Hu moments [54] were used to describe the irregular contour features of landmarks, which possessed invariance about rotation, translation, and scale. Due to the wide range of Hu moments, the logarithm method was used for data compression in order to facilitate comparison. At the same time, considering that the Hu moment may have a negative value, absolute value was taken before the logarithm, as shown in Equation (2):

$$\mathbf{c}\_{i} = \text{sign}(\text{hu}\_{i}) \text{log} |\text{u}\_{i}| \text{ } i = 1, 2, \dots, 7 \tag{2}$$

where *sign*(*x*) is the sign function.

Hu [54] constructed seven invariant moments to describe geometric shape. Therefore, each landmark contour could be expressed as a feature vector by seven Hu moment values through Equation (3):

$$\mathbf{C} = (\mathbf{c}\_1, \mathbf{c}\_2, \mathbf{c}\_3, \mathbf{c}\_4, \mathbf{c}\_5, \mathbf{c}\_6, \mathbf{c}\_7) \tag{3}$$

Through the contour feature vector, the shape difference of landmark contour between query image and candidate image could be calculated by Equation (4):

$$\gamma\_{ij} = \exp\left(\max\_{m=1\dots\mathcal{T}} \frac{\left|\mathbf{c}\_m^{qi} - \mathbf{c}\_m^{cj}\right|}{\left|\mathbf{c}\_m^{qi}\right|}\right) \tag{4}$$

where *c qi <sup>m</sup>* and *c cj <sup>m</sup>* denote the *m*-th Hu moment of the *i*-th landmark contour in query image and the *m*-th Hu moment of the *j*-th landmark contour in candidate image, respectively.

According to cosine distance of the CNN feature and shape similarity obtained by the above calculation, the appearance similarity of the landmark pair between the query and the candidate images could be obtained by Equation (5):

$$d\_{\rm ij} = 1 - d\_{\rm ij}^{\rm cosine} \cdot \gamma\_{\rm ij} \tag{5}$$

In Equation (5), when the contour shape of the landmark is close, γ*ij* is close to 1. If γ*ij* is larger, it indicates that the contour difference of the landmark is large. In addition, when *dij* is close to 1, it means that the landmarks both have similar CNN features and geometric shapes. Furthermore, when *dij* is a negative number, it indicates that the geometric shapes of the landmarks differ greatly. If *dij* is small, it reveals that there may be differences in the CNN features or geometric shapes.

#### 2.2.3. Geometric Similarity

In visual SLAM, the accuracy of loop closure detection is particularly important. Therefore, it is necessary to consider both appearance similarity and geometric similarity during loop closure detection. Thus, the random walk graph descriptor proposed in Section 2.1.4 was used to calculate geometric similarity for graph matching. Denote that the vectorized form of the descriptor matrix *M* = n *<sup>m</sup>ij*<sup>o</sup> <sup>∈</sup> R*m*×2*<sup>n</sup>* using *<sup>G</sup>* <sup>∈</sup> R2*mn* is a concatenation of the columns of *<sup>M</sup>* into a vector.

In Section 2.1.4, we obtained the random walk descriptor of the dataset images and only needed to construct the semantic topology graph to extract the descriptor for the query image. Since the number of pixels was much larger than the node index in value, the absolute size of the feature vector of the description changed greatly. Therefore, it was more appropriate to use cosine similarity to express the relative difference of graph descriptors by Equation (6):

$$\mathcal{S}\_{\mathcal{S}}\Big(\mathcal{G}\_{q\prime}\mathcal{G}\_{\mathcal{C}}\Big) = \frac{\mathcal{G}\_{q\prime}\mathcal{G}\_{\mathcal{C}}}{\|\!\|\!\!\!\!G\_{\prime}q\prime\|\_{2}\cdot\|\!\!\!\!G\_{\mathcal{C}}\|\_{2}}\tag{6}$$

where *G<sup>q</sup>* and *G<sup>c</sup>* denote the feature vector of the random walk descriptor in query image and candidate image, respectively. The denominator is the product of the corresponding vector modulus length. After getting a similarity score, it needed to be normalized with Equation (7):

$$\mathcal{S}\_{\mathcal{S}} = \frac{1}{2} + \frac{1}{2} \mathcal{S}\_{\mathcal{S}} \Big( \mathcal{G}\_{\eta}, \mathcal{G}\_{\mathcal{c}} \Big) \tag{7}$$

Through Equation (7), the similarity score in the range of [0, 1] could be obtained.

#### 2.2.4. Overall Similarity

This section discusses the calculation of the overall similarity between the query image and the candidate image. We not only considered the appearance characteristics of the image but also added geometric constraints. Sections 2.2.2 and 2.2.3 obtained the appearance and geometric similarities of a single landmark pair. Through Equations (8) and (9), we scored each best matched landmark pair *I i q* , *I j c* between the query image *I<sup>q</sup>* and the candidate image *Ic*. The similarity score between each landmark *i* in the query image and the most similar landmark *j* selected by the nearest neighbor search method in the candidate image was first computed, and then the scores were assigned to the candidate image as the mean value of individual scores of its landmarks. Finally, through Equation (10), the overall similarity score of each candidate image was obtained.

$$
\hat{S}\_{q,\mathcal{E}} = \frac{1}{t} \sum\_{i,j} d\_{ij} \tag{8}
$$

$$
\hat{\mathbf{S}}\_{\mathcal{S}} = \frac{1}{t} \sum\_{i,j} \mathbf{S}\_{\mathcal{S}} \tag{9}
$$

where *t* denotes the number of the landmarks in the candidate image *I<sup>c</sup>* (including unmatched landmarks), *i* represents the *i*-th landmark of the current query image, and *j* is the most similar landmark selected by the nearest neighbor search method in the candidate image. Moreover, the sum is done only on the best matched landmark pairs selected by the nearest neighbor search method.

$$\mathcal{S}\_{\text{all}} = \mathcal{S}\_{\text{g}} \mathcal{S}\_{\text{g}\mathcal{L}} \tag{10}$$

In Equation (10), geometric similarity *S*ˆ *<sup>g</sup>* is used as a penalty factor for appearance similarity score to filter candidate images with similar local features but large differences in geometric information. In the experiments, it was normalized to [0, 1]. We normalized a set of overall similarity scores between the current query image and all candidate images (in Section 2.2.1). When the overall similarity score set of the current query image be *X* = {*x*1, *x*2, . . . , *xm*}, *m* denotes the number of candidate images retrieved from the current query image. In addition, *x<sup>i</sup>* (1 < = *i* < = *m*) denotes the overall similarity score between the current query image and the *i*-th candidate image. Through Equation (11), each value in *X* can be normalized to [0, 1]:

$$y\_i = \frac{x\_i - \min(X)}{\max(X) - \min(X)} \tag{11}$$

where *Y* = *y*1, *y*2, . . . , *y<sup>m</sup>* is the normalized score set and *y<sup>i</sup>* is one of the score values.

After obtaining the normalized similarity score for loop closure detection, it is often necessary to perform time and space consistency verification. In this article, the geometry check was not added. Nevertheless, the proposed method that integrates visual, spatial, and semantic information was still found to improve performance.
