*3.1. System Overview*

An overview of the proposed framework is demonstrated in Figure 2. First, the system acquires original 3D point clouds from LiDAR and codes it into scan context. Then, subdescriptor is designed and put into KD-Tree, which is an indexed tree data structure used for nearest neighbor search in large-scale high-dimensional data spaces. A fast k-Nearest Neighbor (kNN) search is then implemented to find nearest candidates from KD-Tree. Then, by calculating minimum distance between query scan context and candidate scan contexts, we can tell whether there is a candidate loop closure. If it exists, our STV process is conducted. The temporal verification will determine whether to trigger re-identification procedure. Finally, once the temporal verification is met, we consider it to be a true loop. Otherwise, we will segment the original point cloud and then use the segmented scan context to calculate new distance. The re-identification procedure utilizes this distance to judge whether a loop is found. The detailed description of these modules is given below.

#### *3.2. Segmentation*

The segmentation module includes two submodules, ground removal and object segmentation. Scan context encodes each bin by taking the maximum height, hence ground points are usually useless and will lead to the increased similarity of different scenes in flat areas. On the other hand, the presence of numerous unstructured objects, such as trees, grass, and other vegetation, will cover the structured information, generating similar descriptors between different places. Meanwhile, it is evident that noises generally do not persist in a certain position over time. Thus, they generally appear scattered and form small-scale objects. Here, we use object segmentation to remove unstructured information in the environment and retain key structured information to prevent mismatches.

**Figure 2.** The pipeline of the proposed STV-SC framework. Grey dotted box above: the two-stage fast candidate loop closure search process, including a k-Nearest Neighbor (kNN) search process and a similarity scoring process. Grey dotted box below: our STV process.

We denote each frame of point cloud from the LiDAR as P = {*p*1, *p*2, ... , *pn*}. For fast cluster-based segmentation, the 3D point cloud is projected into a *Mr* × *Mc* 2D range image *R* for point cloud ordering, where

$$M\_{\rm r} = \frac{360^{\circ}}{\text{Res}\_{\rm l}} \quad , \quad M\_{\rm c} = N\_{\rm scans} . \tag{1}$$

*Resh* is the horizontal resolution and *Nscans* is the line number of LiDAR. Each value of the range image is represented by the Euclidean distance from the sensor to the corresponding point cloud in 3D space. Then, we use a column-wise approach for ground point evaluation on the range image like [28], while leveraging intensity for validation.

After removing the ground, we perform a range image-based object segmentation to classify point clouds into distinct clusters, which is based on [29] but with some improvements according to the characteristics of LiDAR. Specifically, we integrate geometry and intensity constraints for clustering. Previous study [30] showed that different objects exhibit different reflected intensity. Since intensity can be obtained directly from LiDAR, it can serve as an additional layer of validation for clustering. We can judge whether two points *pa* and *pb* belong to object *Ok* by the following mathematical expression. Meanwhile, we set (*a*1, *a*2) and (*b*1, *b*2) as their coordinates in the range image, respectively:

$$p\_{a\_f} \quad p\_b \in O\_k$$

$$\text{s.t.} \quad ||a\_1 - b\_1|| = 1 \quad or \quad ||a\_2 - b\_2|| = 1$$

$$\theta > \varepsilon\_\mathcal{g}$$

$$||I(p\_d) - I(p\_b)|| < \varepsilon\_i$$

$$\theta = \arctan \frac{d\_2 \sin \gamma}{d\_1 - d\_2 \cos \gamma}$$

$$I(p) = \kappa(\psi(p), d). \tag{2}$$

In (2), as shown in Figure 3, *d* stands for the range value from LiDAR to 3D point cloud. *θ* is the angle between the line spawned by *pa*, *pb* and the longer one of OA and OB. *<sup>g</sup>* and *<sup>i</sup>* are predefined thresholds. Additionally, *ψ*(*p*) denotes the intensity of point *p* and *κ* is an intensity calibration function using distance, which can be obtained by practice.

Noticed that as the first-layer judgment, geometry constraint plays a major role. As the second-layer of validation, intensity prevents objects of different types from being clustered together, i.e., under-segmentation.

**Figure 3.** Interpretation of geometry constraint for segmentation. (**a**): three parking cars and laser beams from sensor S. The red line represents the line spawned by two adjacent points. (**b**): geometric abstraction of (**a**). *pa* and *pb* represent two adjacent points.

Moreover, due to the fixed angle between laser beams, points distributed near the sensor are relatively dense, while points far away are sparse. If a fixed geometric threshold is used, we cannot balance the distant and near points. Specifically, if a large threshold is used, the distant points will be over-segmented, and if a small threshold is used, the nearby points will be under-segmented. Thus, in the near area, using a large *<sup>g</sup>* can prevent different objects from being grouped together, while using a small *<sup>g</sup>* in the far area can avoid the same object being segmented into multiple objects.

To achieve more accurate segmentation at different distances, we design a dynamic adjustment strategy. Threshold will be dynamically adjusted as

$$
\epsilon\_{\mathcal{S}} = \epsilon\_{\mathcal{S}}^i - \frac{R(x, y)}{p} q\_{\prime} \tag{3}
$$

where *p* denotes step size and *q* is the decay factor. *<sup>i</sup> <sup>g</sup>* stands for the initial value of *<sup>g</sup>*.

Finally, a breadth-first search based on constraints in (2) is conducted on range image for object clustering. The idea of our segmentation comes from the fact that unstructured objects (mainly vegetation) are filled with gaps, such as leaves. When the laser beams pass through the gaps, the range difference will become large, which will cause large scale vegetation to be separated into small clusters. In the meantime, noise is also a small object. Therefore, we can distinguish structured and unstructured objects by the size of the clusters. In this paper, we treat clusters with more than 30 points or occupying over 5 laser beams as structured objects. As shown in Figure 4, noises, ground, and vegetation are removed, while structured parts, such as buildings and parking cars, are preserved.

**Figure 4.** Visualization of the segmentation process. (**a**): original point clouds of one frame. Vegetation, small moving object, and noise are present. (**b**): segmented point clouds, which shows that unstructured vegetation, noise, etc., are removed.

#### *3.3. Segment Scan Context*

Scan context [7] encodes the scene with the maximum height and then represents it by a 2D image. Figure 5a is the top view of original point clouds. Taking the LiDAR as the center, *Nr* rings are equidistantly divided in the radial direction. In the azimuth direction, *Ns* sectors are divided by equal angles. The area where rings and sectors intersect are called bins. For each bin, a unique representation of the maximum height of point clouds within it is used. Therefore, we can project the 3D point clouds into a 2D matrix of *Nr* × *Ns*, called scan context. Let *Lmax* represents the maximum sensing range of the LiDAR, then the gaps of rings and sectors are *Lmax Nr* and <sup>2</sup>*<sup>π</sup> Ns* , respectively. By adjusting them, we can set

the resolution of scan context.

**Figure 5.** Description of scan context. (**a**): top view of a LiDAR scan, which is separated into bins by rings and sectors. (**b**): colormap of our segment scan context.

However, since scan context uses the maximum height as the unique encoding, it usually results in perceptual aliasing when facing large scale unstructured objects. Like trees on both sides of road, they usually have the same height. Therefore, when encountering scenes dominated by unstructured objects, we merely maintain key structured information obtained via point cloud segmentation. Denote point clouds of a segmented LiDAR scan as P*seg*, segment scan context *<sup>D</sup>* is expressed by

$$D = (d\_{ij}) \in \mathbb{R}^{N\_r \times N\_s} \quad , \quad d\_{ij} = \phi(\mathcal{P}\_{ij}^{\text{seg}}) . \tag{4}$$

P*seg ij* are points in a bin with ring index *i* and sector index *j* and *φ* denotes the function to obtain the maximum height of all point clouds in this bin. Particularly, if there is no point in the bin, its value is set to zero. Visualization of our segment scan context is in Figure 5b. After segmentation, descriptors exhibit discrete blocks representing different structured objects.

#### *3.4. Three-Stage Search Algorithm*

After projecting original point clouds into scan context, the matching process is dedicated to calculating the minimum distance between the descriptor *Dt* obtained at time *t* and the D = {*D*1, *D*2, ... , *Dt*−1} stored previously. Then, the distance determines whether there is a loop closure. In order to achieve fast search and effectively prevent mismatches, we design a three-stage search and verification algorithm.

Stage 1: Fast k-Nearest Neighbor search. Obviously, searching in the database directly using scan context will generate numerous decimal operations, which will slow down the search speed. Here, we perform fast search by extracting sub-descriptors. First, scan context is binarized as follows. Let *B* denotes the matrix after binarization:

$$B(\mathbf{x}, \mathbf{y}) = \begin{cases} \quad \text{0}, & \text{if } D(\mathbf{x}, \mathbf{y}) = \mathbf{0}, \\ \quad \text{1}, & \text{otherwise.} \end{cases} \tag{5}$$

Then, for each row *r* of *B*, we count the number of non-empty bins by calculating *L*<sup>0</sup> norm:

$$\nu(r\_i) = \|r\_i\|\_0. \tag{6}$$

Finally, we construct a one-dimensional sub-descriptor *H* = (*ν*(*r*1), *ν*(*r*2), ... , *ν*(*rn*)) that fulfills rotation invariance. By putting *H* into KD-Tree, we can achieve fast kNN search and provide *k* candidates for the next stage.

Stage 2: Similarity score with column shift. This step will directly use the corresponding scan context to find the nearest frame from the candidates obtained in stage 1. Let *D<sup>q</sup>* denotes the scan context of query scan. *D<sup>c</sup>* denotes one candidate scan context. A columnwise accumulation of cosine distances is used to measure the distance between *D<sup>q</sup>* and *Dc*. The distance is:

$$\varphi(D^q, D^c) = \frac{1}{N\_s} \sum\_{i=1}^{N\_s} (\frac{c\_i^q \cdot c\_i^c}{||c\_i^q|| \cdot ||c\_i^c||}),\tag{7}$$

where *c q <sup>i</sup>* and *<sup>c</sup><sup>c</sup> <sup>i</sup>* are the *<sup>i</sup>*-th column of *<sup>D</sup><sup>q</sup>* and *<sup>D</sup>c*, respectively. In practice, mobile agents may revisit one place from different view-points. To achieve rotation invariance, we conduct a column shift process as

$$\varphi\_{\min}(D^q, D^\varepsilon) = \min\_{j \in [1, N\_\ast]} \varphi(D^q, D\_j^\varepsilon)\_\prime \tag{8}$$

where *D<sup>c</sup> <sup>j</sup>* means shift *<sup>D</sup><sup>c</sup>* by *<sup>j</sup>* columns and *<sup>ϕ</sup>min* represents the final smallest value. If *<sup>ϕ</sup>min* is lower than the predefined threshold *<sup>l</sup>*, then we obtain a candidate *D<sup>c</sup>* for next stage.

Stage 3: Temporal verification and re-identification (STV process). To effectively prevent the generation of false positives, we design a temporal verification module for this candidate loop. Since the detection process of SLAM is continuous in time, the nodes near a true loop also have high similarity. Furthermore, true loops usually exist continuously, while outliers are sporadic. Therefore, we adopt a piecewise idea to verify candidate loop pair:

$$\mathcal{T}(D\_{\mathfrak{m}}, D\_{\mathfrak{n}}) = \frac{1}{N\_t} \sum\_{k=1}^{N\_t} \varphi\_{\min}(D\_{\mathfrak{m}-k\prime}, D\_{\mathfrak{n}-k})\_\prime \tag{9}$$

where *Nt* means the quantity of frames involved for temporal verification. If T less than a threshold *<sup>t</sup>*, we treat it as a true loop. Otherwise, we regard this frame as ambiguous environment and the re-identification module with our segment scan context will be triggered. Specifically, we segment original point clouds and calculate distance between segment scan context of candidate loop pair. Since we have obtained the shift value in the previous stage, we can directly use the result in Equation (8) to calculate the new distance:

$$\varphi\_{\text{seg}}(D^{\text{scçç}}, D^{\text{scçç}}) = \varphi(D^{\text{scçç}}, D^{\text{scçç}}\_{\text{j}^\*}),\tag{10}$$

where *j* <sup>∗</sup> represents the shift value when *ϕmin*(*Dq*, *Dc*) reaches. Finally, if *ϕseg* still less than a threshold *<sup>s</sup>*, we group it into inliers; otherwise, we discard it.

Algorithm 1 depicts our search process in detail, where *num*\_*di f f* represents the minimum interval between two frames that can become a loop closure. *min*\_*dis* means minimum distance.

```
Algorithm 1 Tree-stage search process
Require: Original point cloud P of current frame at time t.
Require: Scan context Dq of current frame at time t.
Require: Sub-descriptors of the previous frames in KD-Tree.
Require: Previous scan contexts D stored before t
 1: k ← 50, q ← index of current frame.
 2: num_di f f ← 50, min_dis ← 100,000.
 3: Build the sub-descriptor H of the current frame (Equations (5) and (6)) and insert it into
   KD-Tree.
 4: if q > k then
 5: Find k nearest candidates in KD-Tree (kNN search).
 6: for i = 1 to k do
 7: ii ← frame index of ith candidate.
 8: if ii − q > num_di f f then
 9: Calculate the distance ϕ between frame q and ii (Equations (7) and (8)).
10: if ϕ < min_dis then
11: min_dis ← ϕ, Dc ← Dii.
12: end if
13: end if
14: end for
15: if min_dis < l then
16: Temporal verification of Dq and Dc (Equation (9)).
17: if τ < t then
18: Loop found!
19: else
20: Segment P to get Pseg (Equation (2)).
21: Construct segment scan context Dseg (Equation (4)).
22: Calculate the distance ϕseg between Dsegq and Dsegc (Equation (7)).
23: if ϕseg < s then
24: Loop found!
25: end if
26: end if
27: end if
28: end if
```