**1. Introduction**

As the first step towards the realization of autonomous intelligent systems, simultaneous localization and mapping (SLAM) has attracted much interest and made astonishing progress over the past 30 years [1]. Place recognition or loop closure detection gives SLAM the ability to identify previously observed places, which is critical for back-end pose graph optimization to eliminate accumulated errors and construct globally consistent maps [2,3]. Benefiting from the popularity of cameras and the development of computer vision, visionbased place recognition has been widely studied. However, cameras inevitably struggle to cope with illumination variance, poor light conditions, and view-point change [4]. Compared with camera, LiDAR is robust to such perceptual variance and provides stable loop closures. Thus, LiDAR-based recognition has drawn more attention recently. LiDAR-based place recognition is achieved by encoding descriptors directly from geometric information or segmented objects. Then, similarity is assessed by the distance between descriptors, such as multi-view 2D projection (M2DP) [5], bag of words (BOW) [6], scan context (SC) [7], pointnetvlad [8], and overlapTransformer [9]. Descriptors are extracted from local or global geometric information (3D point clouds). Segmatch [10], semantic graph based place recog-

**Citation:** Tian, X.; Yi, P.; Zhang, F.; Lei, J.; Hong, Y. STV-SC: Segmentation and Temporal Verification Enhanced Scan Context for Place Recognition in Unstructured Environment. *Sensors* **2022**, *22*, 8604. https://doi.org/10.3390/s22228604

Academic Editor: Gregor Klancar

Received: 12 October 2022 Accepted: 3 November 2022 Published: 8 November 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

nition [11], semantic scan context (SSC) [12], and RINet [13] leverage the segmented objects to define descriptors.

In this paper, we define relatively large regular objects as structured objects (buildings, ground, trunks, etc.) and others as unstructured objects (vegetation, small moving objects, noise points, etc.). In fact, vegetation is most likely to appear on a large scale and obscure structured information. Thus, we mainly consider unstructured scenes dominated by vegetation. One key issue faced by the above methods is that outliers will occur when two places show similar features due to large scale vegetation. As shown in Figure 1, large scale tree leaves will significantly increase the similarity of different places and reduce the influence of other critical objects in the scene, resulting in similar descriptors between different places. This type of unstructured environment often causes perception aliasing and limits recall rate. Finally, the SLAM system is severely distorted, and the mobile agent cannot perceive the environment correctly. Therefore, designing a place recognition algorithm that is robust in unstructured-dominated environments is of great importance for enhancing the environmental adaptability of autonomous intelligent systems (such as self-driving vehicles and mobile robots) and promoting the development of autonomous driving, field survey, etc.

**Figure 1.** Example of false positive detected by Scan context and triggered by our temporal verification module. Top figures: frame 4058 and 4180 of KITTI sequence 00. The vegetation on the right side makes them difficult to distinguish. Since the ground truth distance between these two frames is 148.64 m, they should not be considered as loop closure. Middle figures: colormap corresponding to scan context before segmentation. Bottom figures: segment scan context of corresponding frame represented by colormap. The left side of colormap indicates the preserved buildings, and the empty right side indicates that the vegetation has been removed. After segmentation, these two frames become distinguishable. If we directly use Scan context, the distance between them is 0.1488, resulting in false positive. Our segment scan context acquires a distance up to 0.327, thus, avoiding outliers.

In [14], segmentation is first proposed to deal with certain conditions, such as forest, and demonstrates potential for removing non-critical information. Inspired by this, here, we intend to enhance scan context with segmentation to make it suitable for unstructured environments. At the same time, considering the time continuity of SLAM and the occasionality of outliers, we use a piecewise thought. Specifically, temporal verification is exploited to candidate loop to decide whether to trigger re-identification module. Thus, reducing the time consumption of the whole system.

In this paper, we present segmentation and temporal verification enhanced scan context (STV-SC). We first design a range image-based segmentation method. Next, we explain why segmented point clouds can differentiate between structured and unstructured objects. Then a three-stage search process is proposed for effective false positives avoidance. The STV process checks temporal consistency to determine whether triggering re-identification module. If triggered, we will segment point clouds and remove unstructured objects of the matching frames. Finally, outliers will be filtered out by the similarity score recomputed by segmented descriptors.

The main contributions of this paper are as follows:


This paper is structured as follows. Section 2 reviews the related literature of place recognition in both vision and LiDAR manners. Section 3 introduces the 3D point cloud segmentation algorithm proposed, followed by segment scan context and three-stage search algorithm. Then, the experimental test and its discussion are described in Section 4. Finally, a conclusion is made in Section 5.

#### **2. Related Works**

Depending on the sensing devices used, place recognition can be grouped into visionbased and LiDAR-based methods. Visual place recognition has been well researched and made significant advancement in the past. Generally, visual approaches represent scene features by extracting multiple descriptors, such as Oriented Fast and Rotated BRIEF (ORB) [16] and Scale-Invariant Feature Transforms (SIFT) [17], to construct a dictionary and then leverage bag of words (BOW) [6] model to measure distance between words that belong to different frames. Recently, a learning-based approach has been used for loop detection [18,19]. NetVlad [18] designed a new generalized VLAD layer and implemented it into CNN to achieve end-to-end place recognition. DOOR-SLAM [20] has verified this method in real world SLAM system. However, image representation usually leads to performance degradation when encountering scenes with light illumination and viewpoint change. To overcome such issues, researchers intended to develop robust visual place recognition methods [21–23] to fit change light and season. In spite of this, these methods can only handle certain scenes.

Unlike a camera, LiDAR is robust to environmental changes stated before, while being rotation-invariant. Now, LiDAR-based recognition is still an advanced and challenging problem for laser SLAM systems. LiDAR methods can be further categorized into local descriptors, global descriptors, and learning-based descriptors. Fast point feature histogram (FPFH) [24], keypoint voting [25], and Combination of Bag of Words and Point Feature [6] are state-of-art approaches based on local hand-crafted descriptors. FPFH [24] is coded by calculating key points and their neighbors' underlying surface properties, such as normal and curvature. Through reordering dataset and caching previously computed values, FPFP can reduce run time and apply to real-time systems. Wang et al. [25] proposed a new 3D regional descriptors based on gestalt features and then certain number of neighbors will be voted by key points to do a similarity score. Bastian et al. [6] used Normal-Aligned Radial Features to build a dictionary for bag of words model and realized robust key points and scene matching.

However, local descriptors rely on the acquisition of key points and the calculation of geometric features around key points, which usually lose a lot of information and lead to false matching. Especially for unstructured outdoor objects (e.g., trees), key points from such objects are unreliable.

In contrast, global descriptors are independent of key points and leverage the global point clouds. Multi-view 2D projection (M2DP) [5] is a novel global descriptor from multi-view 2D mapping of 3D point cloud. This descriptor is designed by the left and right singular vectors of each mapping's density signature. Giseop et al. [7] divided the 3D space into 2D bins and coded each bin by the maximum height of points in this bin. Then, the global descriptor is represented as a two-dimensional matrix called Scan context. The matching of frames is performed by calculating the cosine distance between scan context in column-wise way. Scan context outperforms existing global descriptors and shows remarkable rotation invariance, which allows it to handle reverse loops. Based on scan context, ref. [26] explored the value of intensity. By integrating both geometry and intensity information, they developed intensity scan context and proved that intensity can reflect information of different objects. Meanwhile, they proposed a binary search process, which reduces the computation time significantly.

In recent years, learning based methods have been proposed gradually. Segmatch [10] first segments different objects from original point clouds and then extracts multiple features from each object, such as eigenvalue and shape histograms. Finally, they utilized a learning-based classifier to matching objects of different scenes. Kong et al. [11] leveraged semantic segmentation to build a connected graph by the center of different objects and used CNN network to match scenes by judging the similarity of graphs. Refs. [12,27] proposed semantic scan context, which encodes each bin by semantic information. However, learningbased method is usually computationally expensive for the training process and cannot adapt to various outdoor environments due to the limitation of training data.

Global descriptors show excellent performance, but still cannot handle ambiguous environment caused by unstructured objects and generate outliers. In this paper, inspired by [14], we utilize segmentation to remove unstructured objects of scenes, but remain global information and key structured objects. Then we apply segmented point clouds to scan context and construct segment scan context, which makes different places more distinguishable and effectively prevents perceptual aliasing.

#### **3. Materials and Methods**
