Feature extraction and feature mapping are two key steps in loop closure detection, which mainly affect the detection accuracy and time of the algorithm. Based on this, in this section, we first select the appropriate lightweight CNN models for intelligent agricultural equipment, which have been pre-trained, and we only use the CNN models for image feature extraction, which does not require training the CNN models. Then, the image features extracted by CNN models are matched by a hash algorithm. Then two improved algorithms based on the hash algorithm are used to accelerate the matching and compare the performance. Meanwhile, we establish the GreenHouse dataset to demonstrate its performance. Accuracy–recall curves and average accuracy, as well as average time, are used as performance evaluation metrics for loop closure detection.
2.1. Feature Extraction Model Introduction in Loop Closure Detection
The current mainstream visual SLAM systems still rely on corner points to describe images, which limits their ability to characterize non-corner points, especially in images with fewer corners. In contrast, CNNs offer a more comprehensive approach to feature extraction by leveraging the rich data present in images. Lightweight CNNs, in particular, provide the advantage of compact model structures without sacrificing essential features found in larger CNNs.
The CNN model can be compressed by a variety of techniques, such as pruning, weight sharing, weight quantization, and Huffman coding, but these methods may overlook the significance of redundant features [
27]. Of course, it is also possible to design efficient architectural models, thereby reducing model parameters and computational effort while preserving information about redundant features [
28]. For example, ShuffleNet was constructed with specialized core units that combine resolution-related convolutional depth to minimize computational complexity and enhance efficiency [
29]. GhostNet v2, another example, focuses on generating compact feature maps using linear operations while adopting channel mixing to optimize feature representation, effectively reducing the size of the convolutional network model [
30]. VGG19 is based on deeper convolutional neural networks proposed by LeNet and AlexNet to achieve better performance [
31]. Similarly, EfficientNet-B0 replaced the ResNet module with the MBConv module, enhancing the utilization of high-level feature information by redesigning the module architecture [
28].
Due to the fact that efficient architecture models can reduce model parameters and computational workload while minimizing the loss of redundant information, this paper selects four lightweight CNNs—GhostNet, ShuffleNet v2, EfficientNet-B0, and VGG19—for image feature extraction, aiming to explore lightweight approaches that maintain crucial CNN features while enhancing loop closure detection performance. The structures of these CNNs are depicted in
Figure 1, illustrating their feature extraction process and closed-loop detection utilization. Each model, including GhostNet, ShuffleNet v2, and EfficientNet-B0, employs distinct feature reuse strategies to achieve efficiency and effectiveness in loop closure detection. The solid arrows in the figure represent the data flow within the parts of the CNN models used in this paper, while the dashed arrows indicate the original framework of the CNN models.
2.2. Feature Matching in Loop Closure Detection with CNNs
A visual bag-of-words (BoW) model based on manually designed features is the most commonly used solution for loop detection [
32,
33,
34,
35,
36,
37,
38,
39]. This method involves extracting feature points from images using algorithms such as SIFT, SURF, or ORB, followed by clustering to divide these points and their descriptors into multiple words. This allows the detection of related feature vectors for the image through the BoW mapping. Here, we adopt a BoW model based on SIFT feature points and use cosine similarity to measure image similarity.
In agricultural settings, the abundance of local feature points and the scene’s element similarity render traditional methods less practical compared to those based on CNNs. However, CNN-extracted feature vectors often suffer from high dimensionality, necessitating methods like the RHLSH algorithm for downsizing and initial retrieval of image feature vectors. RHLSH partitions high-dimensional space using random hyperplanes and organizes vectors based on their positions [
40]. As illustrated in
Figure 2, the CNN-extracted feature map is reshaped into a feature vector and projected onto randomly generated hyperplanes via hash function families represented by Hamming code. This approach effectively represents the high-dimensional feature map using hash codes on randomly generated, relatively low-dimensional hyperplanes.
In high-dimensional space, any randomly sampled normal vector following the standard multivariate normal distribution
has an equal probability of occurrence in all directions, ensuring uniform sampling [
41]. Consequently, projecting onto multiple hyperplanes and calculating matching scores can enhance the matching accuracy of feature maps. The workflow is as follows: after an image is processed by CNNs and the hash code is generated, the corresponding hash bucket’s feature map is tallied. Each hash code in different hyperplanes corresponds to a distinct hash bucket. Occasionally, multiple feature maps may reside in a hash bucket, indicating that a hash code may correspond to various feature maps. Therefore, statistical analysis of the feature maps within the hash bucket is necessary. Ultimately, the feature map with the highest score surpassing the preset threshold is deemed successfully matched. Otherwise, if the score falls below the threshold, the matching fails, indicating that loop closure has not occurred. This entire process is depicted in
Figure 3.
Increasing the number of hash function families and hash tables can enhance search accuracy and recall, but it also escalates memory space usage. To mitigate this, expanding the search range within the same hash table can be beneficial. Multi-probe Random-Hyperplane Locality-Sensitive Hashing (RHLSH) is an exploration method that improves search recall to some extent. Key strategies for expanding the search range include Step-Wise Probing RHLSH (SWP-RHLSH) and Query-Directed Probing RHLSH (QDP-RHLSH).
For SWP-RHLSH, the Boolean hash value of the feature vector allows for gradual search range expansion based on the number of differing bits in the hash value. As feature vectors dynamically increase in loop closure detection, a linear scan is employed initially to determine hash bucket perturbations within a specified range, expediting the search.
For QDP-RHLSH, a random hyperplane within the same hash table further refines search probability. Hash buckets with a higher likelihood of containing nearest neighbor feature vectors are prioritized, reducing incorrect feature vector exploration. An evaluation probability function with respect to the random hyperplane can be defined as (2) for a given sequence of perturbation vectors
with
. When
, indicating no perturbation, the probability of collision is (1) [
41].
where
is the standard normal distribution function;
is the normal vector of the random hyperplane;
is the angle between the two nearest neighbor feature vectors, and the range of values is usually
.
is defined as the evaluation probability function of the hash code corresponding to the
j hash function under the
I feature vector to be matched after adding perturbations.
Together with the use of the shift transform (3) and the expend transform (4) [
42], the construction of the maximum heap with
as the weights can be achieved to obtain the perturbation vector of the top
M maximum weights, where the perturbation vector is transformed from the set of perturbations, taking
k = 4 as an example: assume that the results of the descending sort of
are
, and for the perturbation set
, the first and fourth positions of
after descending sorting are chosen as the perturbation positions, and the perturbation vector is
.
Among them, the shift transform operation does not work on the empty set, and each operation only adds 1 to the value of the largest element in the perturbation set, while the expend transform operation adds an element larger than the largest element by 1 to the perturbation set. Due to the limitation of the number of perturbation bits , their two operations gradually stop.
In practical loop closure detection systems, when probing the initial M hash buckets, the number of hash buckets for previous image features that need matching increases dynamically. This results in a high proportion of hash buckets that do not exist, leading to significant search time consumption. Therefore, the probing count M should not be a fixed value but rather a segmented function that adjusts based on the number of hash buckets. We define the probing count M as 1000 when the number of hash buckets exceeds 500; otherwise, it is set to 1500.
2.4. Datasets and Pre-Processing
We utilized the TUM dataset and the greenhouse scene dataset captured with the D435i depth camera, as presented in
Table 1. The TUM dataset, sourced from the computer vision group at the Technical University of Munich, Germany, is commonly employed for RGB-D SLAM research. This dataset provides coordinate files of camera motion trajectories detected by high-precision sensors. On the other hand, the greenhouse scene dataset was gathered on 19 February 2021, at 10:00 a.m. in the plant factory of South China Agricultural University, located in Guangzhou, Guangdong Province.
The TUM dataset contains a variety of objects, such as office desks, chairs, computer equipment, and robotic arm models, providing abundant texture and structure for image feature extraction. Additionally, the camera trajectory in this dataset forms a large circular closed trajectory with overlap at the initial and final points. This setup mirrors conditions often found in agricultural scenes, characterized by rich texture structures (as depicted in
Figure 4a).
However, the TUM dataset lacks ground truth information for evaluating loop closure detection algorithms. Instead, it offers camera motion trajectory coordinate files detected by high-precision sensors. To establish correlations between pose coordinate files and image data, the scripting tool provided by TUM was utilized. Matches were defined between pose coordinates and image data with a time difference within 0.02 s. The occurrence of loop closure was determined by calculating the pose error between any two frames within the matched camera pose coordinates. Given the relatively minor positional changes between adjacent images, positional errors of the neighboring 150 images are disregarded. The positional error calculation is expressed as Equation (5).
where
T is the camera pose; subscripts
I,j are the image serial numbers,
and
;
I is the unit matrix.
The collection time of the greenhouse scene dataset is chosen at 10:00 am when the light intensity is high. The dataset includes a variety of green vegetables that have been planted on cultivators, blank cultivators that have not been planted, automated agricultural equipment, and other common agricultural production environment elements. It is also ensured that a large circular closed trajectory exists in the dataset (as shown in
Figure 4b). The GreenHouse dataset captures authentic greenhouse agricultural scenes using the D435i depth camera. Cameras are typically categorized as monocular, binocular, and RGBD. Monocular and binocular cameras require depth estimation through algorithms, while RGBD cameras can directly measure depth. Consequently, RGBD cameras exhibit the highest average depth accuracy among the three types. Therefore, the ORB-SLAM2 system is employed to compute the D435i camera’s motion trajectory in the greenhouse scene dataset, serving as the reference trajectory. And Formula (5) is applied to derive the ground truth for loop closure detection.
The loop closure detection ground truth is saved in the form of a matrix. If the i image and the j image constitute a loop closure, the corresponding value of the ground truth matrix (i, j) is 1, and the opposite is 0.